Reference Genome Analysis
A set of analyses were run on 178 annotated microbial reference genomes, as described in A catalog of reference genomes from the human microbiome. Here we present figures & downloadable datasets resulting from these analyses. Where possible, analyses will be rerun periodically as additional reference genomes are submitted to NCBI with annotation, and updated datasets will be added.
Assembly Metrics
The majority of the Microbial Reference Genomes will be sequenced only to a high-quality draft stage. (More information on finishing levels). The following provisional set of metrics has been developed to ensure that all genomes released through the HMP meet or exceed this quality level.
HMP Provisional Draft Assembly Metrics
- (1) >90% of the genome included in contigs (≥ 500bp) as an indicator of completeness
- (2) >90% of bases at greater than 5x read coverage to provide assurance of high base quality in the consensus sequence
- (3) >5 kb contig N50 length to ensure long enough contiguous sequences so that most genes are intact
- (4) >20 kb scaffold N50 length to ensure long enough scaffolds to capture large operons
- (5) Average contig length > 5kb to provide uniformity throughout the assembly, i.e., assembly is not a few large contigs and many small ones
- (6) >90% of "core genes" present in the gene list, to ensure completeness. For more information on core genes, see the Bacterial and Archaeal core genes tool protocols
The DACC has developed a pipeline for assessing these criteria, which is routinely run on all annotated HMP Reference Genomes available from NCBI. Below are metrics assessments for the original set of 178 genomes available for analysis. This genome dataset is available for download here. (This is a large file, download will take time)
Table 1. Provisional Draft Assembly Metrics, organized by finishing status
| Draft | Improved | |||||
| Number of Strains | 133 | 45 | ||||
| Pass % | Mean | Range | Pass % | Mean | Range | |
| (1) > 90% of the genome included in contigs* | 100% | 98.23% | 95.1-99.9% | 100% | 99.91% | 98.6-100% |
| (2) >90% of the bases greater then 5x read coverage# | 99% | 98.90% | 80.8-100% | 100% | 99.35% | 98.8-99.6% |
| (3) > 5 KB contig N50 | 100% | 102.61KB | 11.12-861.67KB | 100% | 517.92KB | 58.03-3472.99KB |
| N75 | 99% | 54.82KB | 4.97-556.76KB | 100% | 340.20KB | 30.56-2635.77KB |
| N90 | 90% | 25.54KB | 2.01-240.69KB | 100% | 211.51KB | 14.96-2635.77KB |
| (4) > 20 KB scaffold N50* | 100% | 883.93KB | 50.56-3356.77KB | 100% | 606.77KB | 91.71-2898.42KB |
| N75* | 100% | 511.35KB | 24.31-3237.97KB | 100% | 378.22KB | 52.32-2391.23KB |
| N90* | 99% | 282.14KB | 11.74-2490.47KB | 100% | 226.24KB | 28.67-2391.23KB |
| (5) Average contig length >5 Kb | 100% | 31.52KB | 5.62-180.70KB | 100% | 174.70KB | 23.26-1321.04KB |
| (6) > 90% of core genes present in gene list | 99% | 99.63% | 86.4-100% | 100% | 99.90% | 98.5-100% |
*Calculated only for strains with scaffold assemblies: Draft, n=74; Improved, n=37
#Per base coverage was not available for all reads: Draft, n=121; Improved, n=4
The Draft category corresponds to high-quality draft sequences. The Improved category corresponds to Improved High-Quality draft submissions. (More information on finishing levels). No reference genomes had been improved beyond this point at the time that this analysis was performed.
