Finishing Levels

Finishing Levels

The majority of the Microbial Reference Genomes will be sequenced only to a high-quality draft stage1. High-quality draft sequences do not include every base of the genome, rather they are assemblies of several large contiguous pieces of sequence (contigs) with subsequent gaps in sequence knowledge. Nonetheless, they provide enough information for a general assessment of gene content.
About 15% of the Microbial Reference Genomes will be taken to improved levels of finishing up to and including a finished (often referred to as Gold Standard) state. A set of community-defined categories of finishing standards has recently been established1. The HMP reference genome consortium has adopted these standards, adding details specific to their application to HMP Reference Genomes.
The goal finishing standard for every Reference Genome project can be found in the HMP Project Catalog.


Finishing Standards - Provisional HMP Consortium Definitions

The genome standards defined here are aggregated from the following references:

For more information regarding these standards, please review the complete texts.

Standard Draft: This is the minimum standard for a submission to the public databases. It may contain unfiltered data from any number of different sequencing platforms, which are assembled into contigs. Sequence may have poor quality and can be relatively incomplete. It may not always be possible to remove contaminating sequence data. Standard Draft is the least expensive to produce and still possesses useful information.

High-quality draft: This is a draft assembly with little or no manual review of the assembly. Overall coverage should represent at least 90% of the genome or target region. Efforts should be made to exclude contaminating sequences. Sequence errors and misassemblies are possible, with no implied order and orientation to contigs. This is appropriate for general assessment of gene content. High-Quality Draft genome standards, with their rationales, are as follows:

  • (1) >90% of the genome included in contigs (≥ 500bp) so as to ensure completeness for identification of source species for metagenomic sequences. Genome size is estimated as the sum of contigs for fragment assemblies without read pairs, or sum of scaffold spans for assemblies with read pairs
  • (2) >90% of bases at greater than 5x read coverage to provide assurance of high base quality in the consensus sequence (this allows base quality to be assessed independent of sequencing platform as use of quality values may be inconsistent between platforms)
  • (3) >5 kb contig N50 length to ensure long enough contiguous sequences so that most genes are intact
  • (4) >20 kb scaffold N50 length to ensure long enough scaffolds to capture large operons
  • (5) Average contig length > 5kb to provide uniformity throughout the assembly, i.e., assembly is not a few large contigs and many small ones
  • (6) >90% of "core genes" present in the gene list, to ensure completeness. The core genes comprise single copy genes conserved among all sequenced genomes in the superkingdom Bacteria. A similar set of core genes for Archaea was derived. For more information on core genes, see the Bacterial and Archaeal core genes tool protocols.
  • These metrics are further described in Reference (2) and on the Reference Genome Analysis page.

Improved High-Quality Draft: A sequence grade characterized by automated and/or manual work involving manipulation of existing shotgun data or addition of automated directed reads. This should contain no discernable misassemblies and should have undergone some form of gap resolution to reduce the number of contigs and supercontigs (or scaffolds). Undetectable misassemblies are still possible, particularly in repetitive regions. Low-quality regions and potential base errors may also be present. HMP genomes with this designation exhibit a minimum 50 kb contig N50 and are free of N base calls. This standard is normally adequate for comparison with other genomes.

Annotation-Directed Improvement: Finishing work is targeted to clearly defined areas identified by an automated annotation pipeline. This level may overlap with the previous standards, but the term emphasizes the verification and correction of anomalies within coding regions, such as frameshifts and stop codons. It will most often be used in cases involving complex genomes where improvement beyond this category is too costly. Gene models (gene calls, including intron/exon determination for eukaryotes) and annotation of the genomic content should fully support the biology of the organism and the scientific questions being investigated. A coordinate key is included with the submission describing boundaries of Finished vs. Draft sequence. Exceptions to this gene-specific genome standard should be noted in the submission. Repeat regions are not resolved at this level, and may contain errors. Assemblies subjected to Annotation-Directed Improvement will exhibit a minimum 50 kb contig N50 and will carry a representational full-length or attempted full-length 16S rRNA copy. These genomes will be subjected to a second automated annotation after improvement is complete to confirm improvement in quality of gene content. This standard is useful for gene comparisons, alternative splicing analysis and pathway reconstruction.

Noncontiguous Finished: A high quality assembly that has been subjected to automated and manual improvement, and where closure approaches have been successful for almost all gaps, misassemblies, and low-quality regions. Attempts have been made to resolve all gap and sequence uncertainties, and only those recalcitrant to resolution remain. Full annotation of any areas not meeting Finished standard is required. HMP Non-contiguous finished assemblies include but are not limited to a maximum of 3 scaffolds/Mb, must cover 97% of the captured genome, require identification and processing of bacterial plasmids and contain one complete 16S rRNA gene. Base quality is expected at Finished quality (as described below) unless otherwise noted, including removal of low confidence data at contig ends, and resolution of ambiguous bases and potentially misassembled regions.

Finished: Genome sequences with less than 1 error per 100,000 base pairs and where each replicon is assembled into a single contiguous sequence with a minimal number of possible exceptions documented in the submission record. All sequences are complete and have been reviewed and edited, all known misassemblies have been resolved, and repetitive sequences have been ordered and correctly assembled. Remaining exceptions to highly accurate sequence within the euchromatin are commented in the submission. The Finished product is appropriate for all types of detailed analyses and acts as a high-quality reference genome for comparative purposes. Some microbial genome sequences where multiple platforms have been used for the same genome have exceeded this standard, and it is believed that no bases are incorrect except for natural, low-level biological variation.

Return to Reference Genomes

Member Organizations