Microbial Reference Genomes

The HMP plans to sequence, or collect from publicly available sources, a total of 3000 reference genomes isolated from human body sites. The information gained from the reference genomes will aid in taxonomic assignment and functional annotation of 16S rRNA and metagenomic wgs sequence, respectively, from microbiome samples. More information can be found below and on the NIH Common Fund Site.


About Reference Genomes

The HMP plans to sequence 3000 genomes from both cultured and uncultured bacteria, plus several viral and small eukaryotic microbes isolated from human body sites. This, in conjunction with reference genomes sequenced by HMP Demonstration Projects and other members of the International Human Microbiome Consortium (IHMC), will supplement the available selection of non-HMP funded human-associated reference genomes to provide a comprehensive pool of genome sequences to aid in the analysis of human metagenomic data.



Figure 1. Taxonomic domain of currently ongoing or completed HMP reference genomes. Sequencing efforts are largely bacterial-focused, however we anticipate sequencing of ~1000 viruses, and small numbers of representative archaeal and eukaryotic references.



Figure 2. Project affiliation of currently ongoing or completed human-associated reference genomes. Human Microbiome efforts, including the HMP & HMP Demonstration projects, and the IHMC, are contributing considerable numbers of novel reference genome sequence to the pool of human associated genome data available to aid in analysis of metagenomic samples.

Reference genomes have been selected from the same major body sites as HMP metagenomic sampling efforts, as well as additional body sites not targeted by HMP metagenomic sampling, to provide the most through representation.


Figure 3. Human body site of isolation of currently ongoing of completed reference genomes, from both HMP and non-HMP projects.

Details of all reference genomes planned, underway or completed can be found in the HMP project catalog, see left tab.

Data and protocols are available from DACC using the links above.

The DACC provides individual and bulk downloads of completed reference genomes on the HMRGD data page. A non-redundant database of all human associated reference genomes (as of 9/2010) used for metagenomic shotgun read mapping is available on the HMREFG data page.

Individual reference genome data is also available at NCBI.

Reference genome efforts are coordinated through the HMP Strains Working Group. This working group includes representatives of the HMP Sequencing Centers, DACC, Technology Development grantees, and a group of body site experts providing feedback and recommendations on strain selection. Current reference genomes have been collected from public repositories, through collaborations with sequencing centers, and via community feedback.
If you have any questions, comments, or would like to recommend additional reference genomes for inclusion in the HMP catalog, please contact us through the DACC feedback form.

Why Sequence Reference Genomes?

Reference genomes serve as guideposts to aid metagenomic analysis.

Within the human body, it is estimated that there are 10x as many microbial cells as human cells. Our microbial partners carry out a number of metabolic reactions that are not encoded in the human genome and are necessary for human health. Therefore when we talk about the "human genome" we should think of it as an amalgam of human genes and those of our microbes.

The majority of microbial species present in the human body have never been isolated, cultured or sequenced, typically due to the inability to reproduce necessary growth conditions in the lab. Therefore there are huge amounts of organismal and functional novelty still to be discovered. Metagenomics, or the study of microbial communities, will, in theory, reveal this novel content. In order to assign metagenomic sequence to taxonomic and functional groupings, and to differentiate the novel from the previously described, it is necessary to have a large pool of described genomes (aka reference genomes) from the same environments. Reference genomes can then be used to map taxonomic and functional information onto metagenomic sequence (see 16S and metagenomic wgs protocols).


See the left tabs for more information about approaches specific to targeting uncultured or underrepresented reference genomes.

More about Finishing Levels

The majority of the Microbial Reference Genomes will be sequenced only to a high-quality draft stage1. High-quality draft sequences do not include every base of the genome, rather they are assemblies of several large contiguous pieces of sequence (contigs) with subsequent gaps in sequence knowledge. Nonetheless, they provide enough information for a general assessment of gene content.
About 15% of the Microbial Reference Genomes will be taken to improved levels of finishing up to and including a finished (often referred to as Gold Standard) state. A set of community-defined categories of finishing standards has recently been established1. The HMP reference genome consortium has adopted these standards, adding details specific to their application to HMP Reference Genomes.
The goal finishing standard for every Reference Genome project can be found in the HMP Project Catalog.

Finishing Standards - Provisional HMP Consortium Definitions

The genome standards defined here are aggregated from the following references:

For more information regarding these standards, please review the complete texts.

Standard Draft: This is the minimum standard for a submission to the public databases. It may contain unfiltered data from any number of different sequencing platforms, which are assembled into contigs. Sequence may have poor quality and can be relatively incomplete. It may not always be possible to remove contaminating sequence data. Standard Draft is the least expensive to produce and still possesses useful information.

High-quality draft: This is a draft assembly with little or no manual review of the assembly. Overall coverage should represent at least 90% of the genome or target region. Efforts should be made to exclude contaminating sequences. Sequence errors and misassemblies are possible, with no implied order and orientation to contigs. This is appropriate for general assessment of gene content. High-Quality Draft genome standards, with their rationales, are as follows:

  • (1) >90% of the genome included in contigs (≥ 500bp) so as to ensure completeness for identification of source species for metagenomic sequences. Genome size is estimated as the sum of contigs for fragment assemblies without read pairs, or sum of scaffold spans for assemblies with read pairs
  • (2) >90% of bases at greater than 5x read coverage to provide assurance of high base quality in the consensus sequence (this allows base quality to be assessed independent of sequencing platform as use of quality values may be inconsistent between platforms)
  • (3) >5 kb contig N50 length to ensure long enough contiguous sequences so that most genes are intact
  • (4) >20 kb scaffold N50 length to ensure long enough scaffolds to capture large operons
  • (5) Average contig length > 5kb to provide uniformity throughout the assembly, i.e., assembly is not a few large contigs and many small ones
  • (6) >90% of "core genes" present in the gene list, to ensure completeness. The core genes comprise single copy genes conserved among all sequenced genomes in the superkingdom Bacteria. A similar set of core genes for Archaea was derived. For more information on core genes, see the Bacterial and Archaeal core genes tool protocols.
  • These metrics are further described in Reference (2) and on the Reference Genome Analysis page.

Improved High-Quality Draft: A sequence grade characterized by automated and/or manual work involving manipulation of existing shotgun data or addition of automated directed reads. This should contain no discernable misassemblies and should have undergone some form of gap resolution to reduce the number of contigs and supercontigs (or scaffolds). Undetectable misassemblies are still possible, particularly in repetitive regions. Low-quality regions and potential base errors may also be present. HMP genomes with this designation exhibit a minimum 50 kb contig N50 and are free of N base calls. This standard is normally adequate for comparison with other genomes.

Annotation-Directed Improvement: Finishing work is targeted to clearly defined areas identified by an automated annotation pipeline. This level may overlap with the previous standards, but the term emphasizes the verification and correction of anomalies within coding regions, such as frameshifts and stop codons. It will most often be used in cases involving complex genomes where improvement beyond this category is too costly. Gene models (gene calls, including intron/exon determination for eukaryotes) and annotation of the genomic content should fully support the biology of the organism and the scientific questions being investigated. A coordinate key is included with the submission describing boundaries of Finished vs. Draft sequence. Exceptions to this gene-specific genome standard should be noted in the submission. Repeat regions are not resolved at this level, and may contain errors. Assemblies subjected to Annotation-Directed Improvement will exhibit a minimum 50 kb contig N50 and will carry a representational full-length or attempted full-length 16S rRNA copy. These genomes will be subjected to a second automated annotation after improvement is complete to confirm improvement in quality of gene content. This standard is useful for gene comparisons, alternative splicing analysis and pathway reconstruction.

Noncontiguous Finished: A high quality assembly that has been subjected to automated and manual improvement, and where closure approaches have been successful for almost all gaps, misassemblies, and low-quality regions. Attempts have been made to resolve all gap and sequence uncertainties, and only those recalcitrant to resolution remain. Full annotation of any areas not meeting Finished standard is required. HMP Non-contiguous finished assemblies include but are not limited to a maximum of 3 scaffolds/Mb, must cover 97% of the captured genome, require identification and processing of bacterial plasmids and contain one complete 16S rRNA gene. Base quality is expected at Finished quality (as described below) unless otherwise noted, including removal of low confidence data at contig ends, and resolution of ambiguous bases and potentially misassembled regions.

Finished: Genome sequences with less than 1 error per 100,000 base pairs and where each replicon is assembled into a single contiguous sequence with a minimal number of possible exceptions documented in the submission record. All sequences are complete and have been reviewed and edited, all known misassemblies have been resolved, and repetitive sequences have been ordered and correctly assembled. Remaining exceptions to highly accurate sequence within the euchromatin are commented in the submission. The Finished product is appropriate for all types of detailed analyses and acts as a high-quality reference genome for comparative purposes. Some microbial genome sequences where multiple platforms have been used for the same genome have exceeded this standard, and it is believed that no bases are incorrect except for natural, low-level biological variation.

HMP Project Catalog

The HMP Project Catalog provides metadata for all human associated reference genomes. This includes projects that are planned however we are yet to acquire a strain for sequencing (project status "targeted") through those that have reached completion (project status "complete").

The HMP Catalog is built upon the Genomes OnLine (GOLD) database structure and the IMG-GOLD system for capturing genome project information.

  • Users can:
  • View or filter on datasets containing a) only HMP/IHMC reference genomes, b) non-HMP human-associated reference genomes, or c) all human-associated reference genomes
  • View a full list of HMP Reference Genome statuses
  • Filter on 68 different metadata fields, including body site, taxonomic group, project status, sequencing center, etc
  • Link to statistics
  • Create & export custom views of the data as an excel or csv file

DACC/GOLD members work closely with the Genome Standards Consortium to ensure that HMP reference genomes are MIGS compliant.

Reference Genomes Analysis

The primary analysis resource for reference genomes is the img/hmp system, providing support for analyzing HMP reference genomes in the context of all publicly available genomes in IMG. As genomes are released by Genbank they are routinely incorporated into the Integrated Microbial Genomes (IMG) system. IMG analysis of a genome includes extensive functional annotation, pathway analysis, and integration with other genomes. The data can be navigated along the three dimensions of gene, genome, or function.

  • img/hmp provides analysis resources including:
  • Comparative genomics
  • Functional analysis
  • Metadata analysis

Published Analyses

Additional analyses were published in 2010 the first publication describing HMP reference genomes, a catalog of reference genomes from the human microbiome.

Phylogenetic Analysis of 16S rDNA sequences from HMP reference genomes:
Phylogenetic trees were created using 16S rDNA sequences available from the greengenes download directory. An all-HMP reference genome tree was created using ~1800 16S rDNA sequences representing unique species. Bacterial HMP Reference Genomes are highlighted in blue, overlaid upon represented phyla color coded as indicated on each tree. Alignment files are also available.
Download alignment file

Novel gene surveys:
Annotated polypeptides from the initially described set of 178 reference genomes were searched against the bacterial and viral divisions of NCBI's nonredundant (nr) protein database, and compared to a merged database of TIGRfam and Pfam Hidden Markov Models, see publication methods and supplemental material for more details. This analysis resulted in a set of 30,867 polypeptides, of which 29,987 (~97%) were unique.
Download entire set of 30,867 novel polypeptides
Download nonredundant set of 29,987 novel polypeptides

Capturing Elusive Organisms from the Microbiome

Currently there are over 1500 bacterial genomes either in progress or completed as part of the HMP reference genome sequencing efforts, however we are still far from covering the breadth of phylogenetic diversity of the human microbiome. The majority of microbes present in the body are currently uncultured using standard methods. Recent advances in culture and single cell based technologies are making access to these previously inaccessible microbes possible.

The HMP Strains Working Group works very closely with technology development grantees using novel culture- and single cell-based techniques, and is currently involved in pilot projects to sequence previously uncultured and rare species found in human microbiome samples.

Identifying the most wanted taxa from the human microbiome


Image courtesy of the NIH Common Fund

Using methods incorporating 16S-based surveys of metagenomic samples, we have compiled a list of as of yet unsequenced members of the microbiome, prioritized based on their distance from already sequenced genomes and their frequency among samples. See the complete list and much more information on the HMP Most Wanted Taxa data page.

We are very interested in working with community members who have captured organisms from the list. If you have questions or would like to contribute to these efforts, please contact us through the DACC feedback form.

Access to Strains

We are committed to making every microbial reference genome sequenced as part of the HMP available in a public repository.


NIH's statement on repository deposition and center sequencing

The HMP Consortium has made a commitment to making available bacterial strains whose genomes are sequenced as part of the HMP. Many of these strains are already available in public repositories, such as ATCC and DSMZ, but for those strains not available, the HMP will ensure the deposition of strains into the Biodefense and Emerging Infections Research Resources Repository (BEI). The deposition process includes a number of steps required by both the collaborator making the deposit and BEI. In order to prevent unnecessary delays in sequencing, the HMP sequencing centers and NIH staff have agreed to a specific time point at which sequencing can be initiated. Sequencing can begin when the collaborator depositing this strain has contacted BEI and has submitted the deposit forms to BEI. This corresponds with the deposition status of "BEI Contacted" on the HMP DACC. Even though this allows sequencing to begin before the strain is fully available at BEI, the HMP maintains its commitment to ensuring each sequence strain becomes publicly available.

BEI resources is the primary repository for HMP reference genomes


BEI, Biodefense and Emerging Infections Research Resources Repository, was established by the National Institute for Allergy and Infectious Diseases (NIAID) and is managed under contract by ATCC, to provide reagents, tools and organisms for research of category A, B and C pathogens and emerging infectious disease agents, and in the case of the Human Microbiome Project, HMP reference genomes to the general research community. Materials requested from BEI are provided at no cost, aside from shipping & handling. Visit BEI for more information about their role in this project.

Reference genome organisms already available from a public repository are not required to be deposited in BEI. These repositories include ATCC, DSMZ, CCUG, JCM, LCDC and NCTC.
Refer to the Resource Repository column of the HMP Project Catalog for repository identifiers for individual projects, where available. Many targeted projects do not yet have repository information as we are have not yet identified a source for the project.

If repository information is not yet available for an active or completed project that you are interested in, or you have materials that you would be willing to provide for sequencing, please contact us through the DACC feedback form.

Guidelines for Inclusion of Strains in the Microbial Reference Genome Collection & Determination of Finishing Level

The HMP plans to sequence to high-quality draft level, or collect from publicly available sources, a total of 1000 reference genomes isolated from human body sites. Approximately 15% of these will be taken to improved levels of finishing. Here we explain the biological justification for strain selection & identification of strains to be taken to improved finishing levels.

1. Phylogeny and uniqueness of the species.
It is anticipated that the finishing or improvement of the genomes of species that represent novel lineages will enable broad representation of as many lineages as possible, regardless of other criteria, and will provide improved scaffolding for the metagenomic data that are being produced. These genomes will also provide valuable information to groups beyond those involved in metagenomics studies.

2. Established clinical significance.
From the initial work within HMP body site-specific working groups, as well as from external sources and literature on individual strains, we have knowledge regarding relevance to health or disease states. We believe that any strain that has an established clinical significance to some health or disease condition should be included in the subset proposed to receive some level of improvement.

3. Abundance (dominance) in a body site.
Similarly, some strains have accompanying information on abundance and relative abundance in the various body sites. We believe that any strains that have established information on abundance in a body site should be included in the subset proposed to receive some level of improvement. Additional reasoning for these isolates includes:
(a) more predominant organisms will contribute the largest number of shotgun reads and thus should be sequenced to aid in identifying these reads;
(b) more prevalent organisms will most likely have a bigger impact on metabolic capabilities of the community and thus one would want to know their metabolic pathways. This can only be obtained by complete genome sequences or finished genomes.

4. Identical species found in different body sites.
For obvious reasons, duplicate species present an interesting data set that might have different metabolic capabilities dependent on the body sites where they are found. As an example, The Microbial Reference Genome Project Catalog currently includes isolates of Gardnerella vaginalis collected from vagina as well as skin.

5. Opportunity to explore pan-genomes.
Again, isolates that have already been closed by other genome sequencing efforts outside of the HMP may be from other environmental niches, and by having additional closed isolates we can obtain more information on the associated pan-genomes. For example, we are all aware of the extra Megabase of DNA obtained when the genome of E. coli O157 was compared to E. coli K12 as the finished reference genome.

6. Poor quality draft assembly that needs some improvement.
In situations where a genome did not assemble well, we may propose some level of manual improvement to yield a better assembly

7. Other.
In situations where there is some valid criteria other than those justifications listed above.