Register |Login






Reference Genome Analysis

A set of analyses were run on 178 annotated microbial reference genomes, as described in A catalog of reference genomes from the human microbiome. Here we present figures & downloadable datasets resulting from these analyses. Where possible, analyses will be rerun periodically as additional reference genomes are submitted to NCBI with annotation, and updated datasets will be added.


Assembly Metrics

The majority of the Microbial Reference Genomes will be sequenced only to a high-quality draft stage. (More information on finishing levels). The following provisional set of metrics has been developed to ensure that all genomes released through the HMP meet or exceed this quality level.


HMP Provisional Draft Assembly Metrics

  • (1) >90% of the genome included in contigs (≥ 500bp) as an indicator of completeness
  • (2) >90% of bases at greater than 5x read coverage to provide assurance of high base quality in the consensus sequence
  • (3) >5 kb contig N50 length to ensure long enough contiguous sequences so that most genes are intact
  • (4) >20 kb scaffold N50 length to ensure long enough scaffolds to capture large operons
  • (5) Average contig length > 5kb to provide uniformity throughout the assembly, i.e., assembly is not a few large contigs and many small ones
  • (6) >90% of "core genes" present in the gene list, to ensure completeness. For more information on core genes, see the Bacterial and Archaeal core genes tool protocols

The DACC has developed a pipeline for assessing these criteria, which is routinely run on all annotated HMP Reference Genomes available from NCBI. Below are metrics assessments for the original set of 178 genomes available for analysis. This genome dataset is available for download here. (This is a large file, download will take time)


Table 1. Provisional Draft Assembly Metrics, organized by finishing status

Draft Improved
Number of Strains 133 45
Pass % Mean Range Pass % Mean Range
(1) > 90% of the genome included in contigs* 100% 98.23% 95.1-99.9% 100% 99.91% 98.6-100%
(2) >90% of the bases greater then 5x read coverage# 99% 98.90% 80.8-100% 100% 99.35% 98.8-99.6%
(3) > 5 KB contig N50 100% 102.61KB 11.12-861.67KB 100% 517.92KB 58.03-3472.99KB
N75 99% 54.82KB 4.97-556.76KB 100% 340.20KB 30.56-2635.77KB
N90 90% 25.54KB 2.01-240.69KB 100% 211.51KB 14.96-2635.77KB
(4) > 20 KB scaffold N50* 100% 883.93KB 50.56-3356.77KB 100% 606.77KB 91.71-2898.42KB
N75* 100% 511.35KB 24.31-3237.97KB 100% 378.22KB 52.32-2391.23KB
N90* 99% 282.14KB 11.74-2490.47KB 100% 226.24KB 28.67-2391.23KB
(5) Average contig length >5 Kb 100% 31.52KB 5.62-180.70KB 100% 174.70KB 23.26-1321.04KB
(6) > 90% of core genes present in gene list 99% 99.63% 86.4-100% 100% 99.90% 98.5-100%

*Calculated only for strains with scaffold assemblies: Draft, n=74; Improved, n=37
#Per base coverage was not available for all reads: Draft, n=121; Improved, n=4


The Draft category corresponds to high-quality draft sequences. The Improved category corresponds to Improved High-Quality draft submissions. (More information on finishing levels). No reference genomes had been improved beyond this point at the time that this analysis was performed.