Tools and Technology
Standard Operating Procedures (SOPs) developed or used for published HMP1 analysis are listed here, along with links to associated tools, where available. This list is provided as a legacy resource for those who wish to understand the project better, evaluate protocols or replicate HMP1 analyses. This is not intended to be a comprehensive resource of current microbiome analysis tools, nor do we necessarily endorse these tools over others developed since the end of the project.For more current tools, we recommend the following dockerized pipelines (developed at IGS, not directly affiliated with the HMP):
- HMP QC and HUMAnN2 Container - This container performs QC and functional profiling of Metagenomic wgs data, using KneadData and HUMAnN2 respectively.
- HMP QC DADA2 Container - This container performs QC and infers exact amplicon sequence variants (ASVs) from high throughput 16S amplicon data, using the DADA2 R package.
Strain Selection Guidelines
Microbial Reference Genome Analysis
This document outlines the guidelines adopted by the HMP1 Reference Genome Consortium for approval of candidate genera, species or strains to be included in the sequenced reference genome collection. The full collection of sequenced reference genomes, with metadata, is available in the HMP Project Catalog.
BEI Contamination Protocol
This document describes the protocol for communicating detection of contamination in a DNA or microbial sample by an HMP1 Sequencing Center, BEI or HMP Collaborator.
Provisional Reference Genome Assembly Metrics
The HMP1 Reference Genome Consortium defined a set of quality control metrics to be run on every HMP Reference Genomes to ensure accuracy, completeness and continuity of draft and improved assemblies. One of these metrics assesses completeness and annotatability of the reference genome by screening for a predefined set of core bacterial and archael gene sets. Protocols for core gene assessment are provided. The core gene evaluation script download contains a Perl script as well as both archaeal and bacterial core gene fasta and cluster files.
- Provisional Reference Genome Assembly Metrics
- Bacterial Core Gene Evaluation SOP
- Archaeal Core Gene Evaluation SOP
- Core Gene Evaluation Script
The initial set of 178 Bacterial Reference Genomes described in the 2010 publication A Catalog of Reference Genomes from the Human Microbiome were annotated using individual sequencing center SOPs. Subsequent Reference Genomes were annotated using a consensus protocol for gene calling & functional annotation, the Common Gene Annotation SOP, developed by the HMP1 Annotation Working Group. Annotated genomic sequences were submitted to NCBI PRJNA28331. Additional annotation file formats are available at HMRGD.
- Baylor Prokaryotic Annotation SOP
- Broad Institute Prokaryotic Annotation SOP
- JCVI Prokaryotic Annotation SOP
- Washington University Prokaryotic Annotation SOP
- Common Gene Annotation SOP
Creation of a Reference Genome Database
As part of efforts to map HMP mWGS reads to sequences reference genomes (see dataset HMSCP), the group at Washington University created a provisional reference genome database, comprised of all archaeal, bacterial, lower eukaryotic and viral organisms available from Genbank prior to 11/2009. Both the protocol for generating the database, and the database itself are available here. The database contains 131 archaeal strains over 97 species, 326 lower eukaryotes over 326 species, 3683 viral strains over 1420 species, and 1751 bacterial strains over 1253 species. The bacterial component of the database underwent a process of removing highly redundant, non HMP-sequenced reference genomes.
HMP Most Wanted Genomes
Using methods incorporating 16S-based metagenomic surveys of the HMP healthy human cohort, data from each of 18 different body sites was examined and organisms to be targeted using culture and single cell-based approaches were prioritized based on phylogenetic distance from previously sequenced strains and frequency among samples. The resulting Most Wanted Genomes list was a resource for community members interested in isolating and sequencing novel and previously unsequenced organisms found in association with humans.
HMP1 Manual of Procedures (MOP)
A reference document for National Institutes of Health (NIH) policies and procedures as they apply to the Human Microbiome Project (HMP) Core Microbiome Sampling study.
- Manual of Procedures, version 12.0
- Supplement and Updates to the HMP MOP v12.0 - updated 7/2012
- Core Microbiome Sampling Protocol A, version 9.0
Study participant consent forms
Samples were collected from 300 healthy adult men and women between the ages of 18 and 40, recruited at Baylor College of Medicine in Houston, TX and Washington University in St. Louis, MO. Here we provide consent forms and other information provided to participants by the two universities.
16S 454 Sequencing Protocol
16S Sequencing and Analysis
This 454 Sequencing Protocol was developed by the HMP Consortium for the clinical sample pilot study. The SOP document includes barcoded primers seqeuncfes used for amplification of 16S variable regions V1-3 and V3-5.
16S Data Flow for HMP Sequencing Centers
Guidelines developed for the HMP Sequencing Centers regarding submission of 16S rRNA gene data and metadata to the HMP DACC.
SFF and Library Metadata File (lmd) Generation
The HMP DACC downloaded all clinical sample sequence runs from NCBI's Sequence Read Archive (SRA), and converted SRA native format to SFF format. SRA XML files were parsed for metadata and tab delimited library metadata (LMD) files created. This SOP describes file creation for SRA study accession SRP002395 (Clinical Production Phase I 16S 454 Sequencing), available at HMR16S. Subsequent HMP 16S 454 studies submitted to SRA were processed in the same fashion.
16S rRNA Processing
The HMP DCC performed baseline processing of all 16S variable region sequences. Processing involved trimming and deconvoluting raw SFF files (HMR16S), reference alignment using NAST-iEr, chimera identification using ChimeraSlayer, aberrant sequence identification using WigeoN, and taxonomic binning using RDP classifier. Trimmed reads per sample, and processing results are available at HM16STR
- 16S rRNA Processing SOP
- MicrobiomeUtilities - a set of utilies including NAST-iEr, ChimeraSlayer and WigeoN, 2010-04-29 release
- RDP - latest release
- RDP Classifier v2.2
mothur curation pipeline
16S rRNA gene sequences were processed using the mothur software package v1.18, using both a high and low stringency approach. The high stringency approach provided an output with more aggressive sequence error reduction tailored towards Operational Taxonomic Unit (OTU) construction, while the low stringency approach favored longer read lengths tailored towards taxonomic classification. The mothur output from both high and low stringency approaches is available at HMMCP
- 16S rRNA Gene Sequence Curation Pipeline SOP
- mothur - latest release
- mothur v1.18 - April 2011 release
- mothur wiki
QIIME community profiling
Raw 16S sequences and metadata (HMR16S) were demultiplexed, and underwent OTU picking, taxonomic assignment, and construction of phylogenetic trees from representative sequences, and through downstream statistical analysis and visualization, using tools contained with the QIIME package and additional custom scripts available here. QIIME output is available at HMQCP.
- Qiime Community Profiling SOP
- Greengenes 16S rRNA database
- RDP Classifier v2.2
- SitePainter - a tool for exploring biogeographical patterns
HMP Single Cell MDA 16S rRNA Sanger sequencing SOP
In conjunction with the HMP Reference Genome Working Group's "Most Wanted Genome" efforts to acquire and sequence representative genomes covering the breadth of phylogenetic diversity of human-associated microorganisms, a small number of HMP fecal, oral and skin samples underwent single bacterial cell multiple displacement amplification (MDA). 16S rRNA gene sequences were obtained to taxonomically classify the MDA reactions. Representative 16S sequences are available at HMMDA16S
Human Sequence Removal
Whole Metagenome Sequencing and Analysis
Metagenomic shotgun sequencing was performed using the Illumina GAIIx platform. Prior to public release, all mWGS sequences submitted to SRA were processed to filter out human sequence using NCBI's Best Match Tagger (BMTagger).
mWGS Read Processing
Reads were processed to identify and mask human reads, remove duplicated reads, and trim low quality bases. Reads are available at HMIWGS.
The 2012 landmark HMP Nature publications describe ~700 samples sequenced and assembled using SOAPdenovo, and subsequent QC and downstream analysis. These assemblies are available at HMASM. Unassembled reads were grouped by body site and assembled using SOAPdenovo, to generate body site specific assemblies available at HMBSA. A subset of twelve of the initial stool samples were selected for supplementary sequencing using the 454 FLX Titanium platform. Processed, human contaminant filtered reads from the 454/Illumina hybrid data were assembled using Newbler and are available at HMHASM.
In 2017, all Illumina mWGS reads were reprocessed and assembled using IDBA-UD. Raw and assembled data are available at hmwgsqc2 and hmasm2, respectively. In cases where multiple samples were collected from the same host and body site across two or three timepoints (visits), those quality trimmed read sets were concatenated and co-assembled using IDBA-UD, using the same protocol as used for individual sample assemblies. These co-assemblies are available at hmcasm2.
- Whole Metagenome Assembly SOP using SOAPdenovo
- Body Site Specific Assembly SOP using SOAPdenovo
- Hybrid Assembly SOP using Newbler
- Whole Metagenome Assembly SOP using IDBA-UD (coming soon)
- SOAPdenovo v1.04
mWGS Annotation Protocols
Assembled sequences underwent structural and functional annotation using a pipeline developed at the JCVI, later released as METAREP. In 2017, improved assemblies were re-annotated using Attributor. METAREP and Attributor gene indices are available at HMGI and hmgi2, respectively. Annotated gene indices were clustered to create a non-redundant catalog of bacterial proteins, see HMGC and hmgc2. 2017 co-assemblies were likewise annotated and clustered, see hmcgi2 and hmcgc2.
- Prokaryotic Metagenomics Annotation Pipeline SOP (JCVI)
- Attributor SOP (coming soon)
- Non-redundant Clustered Gene Index SOP
Gene Ontology Analysis
Annotated gene indices were mapped to a slimmed down GO ontology built by the HMP DACC, to create a summary matrix of the biological processes represented across the sample set, available at HMGS.
Two strategies were initially adopted to estimate organismal abundance from Illumina mWGS reads. In one approach, reads were mapped onto a database of reference genomes developed by the HMP consortium. These results are available at HMSCP. In the second approach, MetaPhlAn classifier compared each read to a pre-computed catalog of unique clade-specific markers. Metaphlan results are available at HMSMCP.
- Read mapping to REFG database SOP
- MetaPhlan SOP (coming soon)
- Reference Genome Database REFG
- Metaphlan v1.1.0.
Metabolic Reconstruction and Pathway Analysis
In order to reconstruct the metabolic activities of microbiome communities, the HMP Consortium developed the HMP Unified Metabolic Analysis Network (HUMAnN) pipeline, which infers community function directly from short mWGS reads, using the KEGG ortholog (KO) groups. These results are available at HMMRC.
During the active years of the HMP, the DACC created a set of walkthroughs, step-by-step tutorials taking users through typical HMP analysis paths, complete with sample datasets, details steps, screenshots and example output. These were geared toward educating researchers, particularly those without extensive bioinformatics infrastructures or experience, on utilizing selected tools and resources to reproduce HMP analyses, using HMP-generated or personal data as input. HMP walkthroughs utilized CloVR, a desktop application integrating state-of-the-art genomic tools in a robust, user friendly, fully automated software package with optional support for cloud computing platforms. CloVR is distributed as a portable virtual machine launched on a desktop or laptop under VMware or Virtualbox.
Legacy Walkthrough Tutorials
CloVR funding has since ended, and the resource and walkthroughs are no longer maintained or supported. We provide links to the walkthroughs below as a legacy resource only.
- HMP CloVR-16S rRNA Pipeline - The CloVR-16S pipeline processes short and long sequence reads from Sanger and Roche/454 sequencing platforms, including sequence reads generated with the multiplex amplicon 454 pyrosequencing protocol with specifically tagged or barcoded 16S rRNA PCR primers. This pipeline employs several well-known phylogenetic tools and protocols: QIIME, a Python-based workflow package, allowing for sequence processing and phylogenetic analysis using different methods including the phylogenetic distance metric UniFrac, UCLUST,PyNAST and the RDP Bayesian classifier; 2UCHIME - a tool for rapid identification of chimeric 16S sequence fragments; Mothur - a C++-based software package for 16S analysis; Metastats and custom R scripts used to generate additional statistical and graphical evaluations. This walkthrough uses HMP 16S rRNA sequences representing communities extracted from 12 hard-palate and 12 attached-keratinized gingiva oral sites.
- HMP CloVR-Metagenomics Shotgun Analysis Pipeline - The CloVR-Metagenomics protocol supports the analysis of shotgun sequencing data from total metagenomic DNA sequencing projects. This pipeline utilizes a number of well-known tools for analysis of metagenomic data: UCLUST clusters redundant sequences showing 99% nucleotide identity and removes artificial 454 replicate reads; representative DNA sequences are searched against the NCBI COG database using BLASTX; representative DNA sequences are searched against the NCBI RefSeq database of finished prokaryotic genomes using BLASTN; Metastats and custom R scripts are applied for additional statistical and graphical evaluations of the pipeline results. CloVR-Metagenomics generates several output reports including taxonomic and functional abundance tables, statistical comparisons of feature abundances between user-defined populations, and heatmaps with unsupervised clusterings of all samples. This walkthrough uses HMP wgs reads representing microbial communities extracted from the mid-vagina and vaginal introitus sites.
- HMP CloVR-Human Contaminant Screening Pipeline - This pipeline uses the NCBI BMTagger (Best Match Tagger) tool to identify and remove human reads in metagenomic sequences. For this walkthrough, we use a mock dataset which consists of a 50:50 mix of human contaminant-screened reads from an HMP project and filtered human reads from a 1000 genomes project.
- HMP CloVR-mWGS Assembly Pipeline - This pipeline is used to generate a "Pretty Good Assembly" a reasonable attempt at reconstructing pieces of the organisms present in the community that are long enough to allow gene finding and other downstream analyses. This version of the pipeline uses SOAPdenovo v.1.04 and is based on the HMP Whole Metagenome Assembly SOP using SOAPdenovo. For this walkthrough, we use a sample from the HMP Anterior Nares body site.
- HMP CloVR-Read to Reference Genome Alignment Pipeline - This walkthrough provides a simple example of how to set-up and run the Bowtie Aligner, as well as analyze the resulting output. For this walkthrough, We align metagenomic WGS reads extracted from the Anterior Nares body site (sample SRS019215), to reference genome Staphylococcus aureus.
- HMP CloVR-HUMAnN Pipeline - The HMP Unified Metabolic Analysis Network (HUMAnN) pipeline was developed by the HMP Consortium to efficiently and accurately determining the presence/absence and abundance of microbial pathways in a community from metagenomic data. Sequencing a metagenome typically produces millions of short DNA/RNA reads. HUMAnN takes these reads as inputs and produces gene and pathway summaries as outputs: The abundance of each orthologous gene family in the community; the presence/absence of each pathway in the community; the abundance of each pathway in the community, i.e. how many copies of that pathway are present. For this walkthrough, we use genes from the Anterior Nares body site (sample SRS019215).
- HMP CloVR-Digital Normalization Pipeline - This pipeline uses the DigiNorm algorithm to normalize metagenomic reads, substantially reducing the size without any significant impact on the assemblies that will be generated. This walkthrough uses a sample dataset from the HMP Illumina WGS Reads, Sample SRS018671.
- HMP CloVR-Gene Clustering Pipeline - This pipeline takes gene predictions from metagenomic shotgun sequence data (assemblies or reads), and generates a non-redundant gene set, using USEARCH (Edgar, 2010).