Microbiome Analyses

The HMP is performing 16S rRNA and metagenomic sequencing of samples from a healthy human population to address questions such as whether there is a "core" microbiome at individual body sites and whether variation in the microbiome can be systematically studied. More information can be found below and on the NIH Common Fund Site.


About HMP Metagenomic Sequencing & Analysis

Within the human body, it is estimated that there are 10x as many microbial cells as human cells. Our microbial partners carry out a number of metabolic reactions that are not encoded in the human genome and are necessary for human health. Therefore when we talk about the "human genome" we should think of it as an amalgam of human genes and those of our microbes.

The majority of microbial species present in the human body have never been isolated, cultured or sequenced, typically due to the inability to reproduce necessary growth conditions in the lab. Therefore there are huge amounts of organismal and functional novelty still to be discovered. Metagenomics, or the study of microbial communities, takes advantage of advances in sequencing technology and analysis methods to comprehensively examine microbial communities directly from their natural habitats, potentially revealing novel content. The HMP is using 16S rRNA and metagenomic sequencing to characterize the complexity of the human microbiome at 15 or 18 body sites. This supplements the sequencing and analysis of reference genomes isolate from human body sites, generating unprecedented amounts of data about the complexity of the human microbiome, and providing a baseline for further research into the impacts of the microbiome on human health and disease.

16S and wgs sequencing occurred in a four-phase process, consisting of a mock data pilot phase, a clinical pilot phase, clinical phase 1 completed in July 2010, and clinical phase 2 currently in progress. For more details on the number of samples, and the amounts of sequence & analysis output generated for each data type, see the tabs to the left, and the HMP Data Tour Guide.

If you're interested in joint analysis of 16S and shotgun metagenomic datasets from the HMP, pairing up data from the same microbiome samples can initially seem tricky. The following tables join 16S dataset "SN" and "PSN" identifiers with metagenomic dataset "SRS" identifiers. The diagram below indicates how these sample IDs are related experimentally.

Sample flow of the clinical phases is described in the following schematic:

  • (1) Samples were collected at 2 dedicated sampling centers (see Sample collection for more detail), with each human participant being sampled at 15 or 18 body sites over one to three visits. (2) All subject & sample metadata was submitted to EMMES, our partner in metadata storage and management. EMMES assigned each human subject a de-identified RSID (random subject identifier).
  • (3) Every individual sample was assigned a PSN id, an EMMES-generated Primary Sample Number. (4) DNA was extracted from the primary samples using methods described in the Manual of Procedures, and assigned a NAP id (Nucleic Acid Preparation) generated by the sequencing centers. Extractions were divided into aliquots as necessary. Each aliquot was assigned an SN id, an EMMES-generated Sample Number. For the clinical pilot phase, every primary sample was sent to two sequencing centers to assess reproducibility of technical replicates across multiple institutions. Therefore there is a one-to-many relationship between PSN and SN ids. In the clinical pilot phases, a single PSN sample generated a single SN sample, which was sent to a single sequencing center, allowing for a one-to-one relationship between PSN and SN. (5) Sequencing centers performed sequencing on each SN sample, sequencing 16S variable regions 3-5 (V35) from every samples within the study, and variable regions 1-3 (V13) and variable regions 6-9 (V69) from subsets of samples. If the SN samples did not provide enough material for sequencing, the centers were sent the original PSN samples for re-extraction and sequencing.
  • (6) All sequence data was submitted by the sequencing centers directly to NBCI's Sequence Read Archives (SRA) data repository. Centers submitted data using the RSID subject ids, and the NAP sample ids. Data was then organized at NCBI into their standardized levels of experiment accession (SRX ids), run id (SRR ids) and sample id (SRS ids).

Data, and tools and protocols used by the HMP consortium are available from DACC using the links above.

Data is also available from NCBI through the HMP Roadmap Project BioProject page

Metagenomic sequencing and analysis efforts have been coordinated through a large number of working groups covering all aspects from sequence generation & processing, through data release. If you have questions or comments, please contact us through the DACC feedback form.

Sample Collection

16S rRNA and metagenomic wgs sequencing is being performed on samples collected from 242 healthy adult men and women between the ages of 18 and 40, recruited at Baylor College of Medicine in Houston, TX and Washington University in St. Louis, MO. See a full list of study inclusion/exclusion criteria.

Samples were collected in a non-invasive manner from five major body sites: oral cavity, nasal cavity, skin, gastrointestinal tract and urogenital tract; with a total of 15 or 18 specific body sites.

  • Oral cavity
  • Attached keratinized gingiva (gums)
  • Buccal mucosa (cheek)
  • Hard palate
  • Palatine tonsils
  • Saliva
  • Subgingival plaque
  • Supragingival plaque
  • Throat
  • Tongue dorsum
  • Nasal cavity
  • Anterior nares (nostrils)
  • Skin
  • Left and right antecubital fossa (inner elbow)
  • Left and right retroauricular crease (behind the ear)
  • Gastrointestinal tract
  • Stool
  • Urogenital tract
  • Mid vagina
  • Posterior fornix
  • Vaginal introitus

In addition, subjects donated blood to be used to examine the relationship of host genotype to an individual's microbiota, and serum for possible future studies, such as determination of immune response to organisms identified in the individual's microbiome.

See the HMP Manual of Procedures - Core Microbiome Sampling Protocol for more information on sampling procedures used for collecting metagenomic samples under the HMP.

All enrolled subjects were sampled at one visit, with a subset of subject sampled at up to three visits. Patient screening began in November 2008 and the final sample was taken in October 2010, for a total of over 11,000 primary specimens from 300 adults.

Study Participant information

These documents represent the consent forms being used at Washington University and Baylor College of Medicine, along with supplemental information being provided to potential donors.

Study participants who have questions or concerns about any aspect of the project should contact:

Baylor College of Medicine

James Versalovic, MD, PhD

Baylor College of Medicine

Texas Children's Hospital

6621 Fannin Street, MC 1-2261

Houston, TX 77030

phone: (832) 824-2213

email: jamesv@bcm.edu

Wendy Keitel, MD

Department of Molecular Virology and Microbiology

Baylor College of Medicine 280

One Baylor Plaza

Houston, TX 77030

phone: (713) 798-5250

email: wkeitel@bcm.edu

Washington University

Mark Watson, MD, PhD

Washington University School of Medicine

Campus Box 8118

660 S. Euclid Avenue

St. Louis, MO 63110

phone: (314) 454-7919

email: watsonm@wustl.edu

Michael Dunne, PhD

Washington University School of Medicine

258A Barnes-Jewish Hospital Service Building

St. Louis, MO 63110

phone: (314) 362-1547

email: dunne@wustl.edu

Validation using mock communities

At the start of the HMP, there was no standardized protocol for ensuring high throughput consistency of 16S amplification & sequencing protocols. Therefore the HMP evaluated a number of protocols using a synthetic mock community of 21 known organisms, before adopting the HMP 16S 454 Protocol. These organisms include a variety of genera commonly found on or within the human body. Genomic DNA from each organism was mixed, based on qPCR of 16S rRNA measurements, to generate two mock mixtures:

  • Even mock community
  • 100,000 16S copies per organism per aliquot
  • Staggered mock community
  • 1,000 to 1,000,000 16S copies per organism per aliquot

Mock communities were also used to evaluate consistency between sequencing centers of both 16S rRNA and metagenomic wgs sequencing.

Mock communities are available to the community through the BEI Resource:

Mock sequence data can be found at NCBI. The DACC hosts a summary table of community composition. See HMMC for more information.

16S Sequencing & Analysis

16S rRNA sequencing has been used to characterize the complexity of microbial communities at each body sites, and to determine whether there is a core microbiome.

The 16S rRNA sequence contains both highly conserved and variable regions. These variable regions, nine in number (V1 through V9), are routinely used to classify organisms according to phylogeny, making 16S rRNA sequencing particularly useful in metagenomics to help identify taxonomic groups present in a sample.


Fundamental to the use of 16S rRNA sequencing and analysis by the HMP was the development of a consistent 16S sequencing protocol used across all groups contributing to this project.

  • Phase I Sequencing & Analysis
  • 5,771 samples were chosen from 242 adults, covering all 18 body sites
  • sequencing was performed using the Roche-454 FLX Titanium platform
  • V3-V5 variable region window (V35) was sequenced for all samples
  • V1-V3 variable region window (V13) was sequenced for a subset of ~3,000 samples to provide a complementary taxonomic view
  • this resulted in a total of >10,000 sequence preparations, sequenced over 7518 SRA runs
  • metagenomic wgs sequencing was performed for a subset of 560 of these samples
  • Phase 2 efforts are ongoing to sequence and analyze remaining samples

Phase 1 data and protocols are available from the DACC using the links above. Raw reads are also available from NCBI.

DACC members work closely with the Genome Standards Consortium to ensure that HMP 16S rRNA sample and sequence preparation data is MIMARKS compliant, and are currently working with NCBI to make that metadata available to the public.


Metagenomic WGS Sequencing and Analysis

Metagenomic whole genome shotgun (wgs) sequencing provides insights into the functions and pathways present in the human microbiome. HMP data will serve to generate a reference framework for those looking into associations between changes in the human microbiome and disease states.

  • Phase I Sequencing & Analysis
  • 764 samples were chosen from 103 adults, covering 16 body sites
  • sequencing was performed using the Illumina GAIIx platform with 101bp paired-end reads
  • all samples were screened for human contamination using NCBI's BMTagger tool, with ~49% of reads targeted for removal as human. These reads are subjected to authorized access only, through NCBI dbGAP
  • 749 samples provided sufficient sequence for assembly using an optimized SOAPdenovo protocol
  • Of these, 690 samples passed quality control assessments, including identification of outliers by mean contig and ORF density, human hits, rRNA hits and size. These samples were used for downstream wgs analysis, including gene calling and annotation, community profiling by reference genome mapping, and metabolic reconstruction.
  • 16S rRNA sequencing was performed for a subset of 560 of these samples
  • In addition, a subset of 12 stool samples were simulataneously sequenced using the 454 FLX Titanium platform
  • All 12 sequence sets were assembled using Newbler, and are referred to on the DACC as Hybrid Assemblies
  • Phase 2 efforts are ongoing to sequence and analyze remaining samples

Phase 1 data and protocols are available from the DACC using the links above. Raw reads are also available from NCBI Metadata is available through the HMP Project catalog, see left tab.


HMP Project Catalog


The HMP Project Catalog provides metadata for all HMP metagenomic samples, with links to read data available through the NCBI Sequence Read Archive (SRA).

Public metadata fields include sample name, body site, de-identified subject id, and visit number. All other metadata is confidential and accessible only to authorized users through NCBI dbGAP.

The HMP Catalog is built upon the Genomes OnLine (GOLD) database structure and the IMG-GOLD system for capturing genome project information.

DACC members work closely with the Genome Standards Consortium to ensure that HMP metagenomic wgs data is MIMS compliant.

Metagenomic Analysis Resources

In addition to tools and protocols available throughout the DACC, as well as published analyses, members of the HMP consortium have developed the following resources highlighting HMP metagenomic data:


The img/hmp m system, developed by the DOE Joint Genome Institute, provides support for analyzing HMP reference genomes & metagenomes in the context of all publicly available genomes in IMG. Data can be navigated along the three dimensions of gene, genome, or function.

  • img/hmp provides analysis resources including:
  • Comparative genomics
  • Functional analysis
  • Metadata analysis



Sitepainter, developed in the Knight Lab at University of Colorado Boulder, allows for visualization of HMP 16S rRNA and metagenomic wgs data by color gradients of taxonomic or fucntional pathway data over individual body sites.