Data Model

This material provides a quick tour of much of the data available from the Human Microbiome Project, but it is not an exhaustive inventory of all data sets and analysis products. HMP1 generated large amounts of genomic and metagenomic sequence data. There are two primary portals for accessing data:

Framework of Sequence data: Cohort type and Data type

One way of organizing much (though not all) of the metagenomic sequence data generated under the project is to split it by cohort type and data type.

There are two primary cohort types:

  1. Center "Healthy Cohort": This is a single cohort of 300 healthy individuals, each sampled at 5 major body sites (oral, airways, skin, gut, vagina) at up to three timepoints. Each body site consisted of a number of body subsites, for a total of 15 to 18 samples per individual per timepoint. See Microbiome Analysis for more information.
  2. Demonstration Project "Disease Cohorts": These 15 projects each have one or more cohorts aimed at studying specific health conditions. Each project developed sampling, processing, and 16S or whole metagenome shotgun sequencing approaches according to their condition of interest. These cohorts include both controls and affected individuals. See demonstration projects for a brief description of each project.

There are three primary data types:

  1. Reference microbial genomes: Most of these are not derived from specific cohorts
  2. Whole metagenome shotgun (mWGS) sequence
  3. 16S metagenomic sequence

The resulting division can be roughly represented by the following table:

Center"Healthy Cohort" Demonstration Project "Disease Cohorts"
Reference microbial genomes >2000 strains
NCBI BioProject 28331
Hundreds of strains
NCBI BioProject 46305
mWGS metagenomic sequence Subset of the 300 subjects, multiple timepoints, 15+ bodysites
NCBI BioProject 43017
5 projects, each with unique, sampling sites, conditions, etc.
NCBI BioProject 46305
16S metagenomic sequence 300 subjects, multiple timepoints, 15+ bodysites
NCBI BioProject 48489
14 projects, each with unique, sampling sites, conditions, etc. 4 projects contain both 16S and mWGS components
NCBI BioProject 46305

All sequence data is openly available for download. To protect subject privacy, data has been filtered to remove contaminating human sequence.

Framework of Clinical Data

In addition to the generation of mWGS and/or 16S metagenomic sequence data, information, or metadata, about the human subjects was also collected. To protect subject privacy, those data are available to qualified researchers only through NCBI's dbGaP portal. "Qualified researchers" are defined as PI-level investigators at legitimate institutions who can describe how they plan to use the data and can follow a series of precautions to safeguard patient privacy. Detailed information on accessing private data is available at the NCBI dbGaP site.

Only the following clinical metadata are available outside of dbGaP, directly embedded in the sequence file metadata and available through the HMP Metadata Catalog:

  1. Unique subject ID
  2. Body site
  3. Sex (male/female)
  4. Visit number

No approval is necessary to access these metadata fields.

Accessing sequence data

Most of the raw sequence data reside at NCBI's Sequence Read Archive (SRA). The most straightforward way to identify all of the SRA data associated with a particular dataset is to enter through the BioProject pages in the table above. Each project-level BioProject page provides links to all associated SRA experiments (SRX). Alternately, it is also possible to begin in the SRA and search for all experiments that are linked to a given BioProject ID. Both processes can be performed manually through NCBI's website or by using E-utilities.

The DACC Data Browser hosts value-added datasets representing numerous steps along common analysis paths. This is intended to allow researchers to bin analysis pipelines mid-stream, dedicating time to the areas they find most important.

