Tools and Technology


Tools

Software and online resources used by, or developed as part of the HMP are provided here.

Please be aware that HMP1 funding ended in 2012, and therefore some of these resources may have changed, moved or been discontinued. This list is no longer regularly maintained.

Microbial Reference Genomes

Downloadable Tools
Core Gene Evaluation ScriptScreening for core gene sets as an indicator of completeness of draft genomes. This download includes a Perl script and required archaeal and bacterial core genes fasta and cluster files.
Online Resources
IMG System
A community resource for comparative analysis and annotation of publicly available genomes in a uniquely integrated context
RAST Annotation Server
A fully-automated service for annotating bacterial and archaeal genomes, leveraging data and procedures established within the SEED framework to provide high quality gene calling and functional annotation

Sampling, Sequencing, & Analyses of 16S RNA

Downloadable Tools
DNAclust
DNAclust is a fast clustering algorithm specifically designed for high-stringency clustering of DNA sequences, e.g. for 16S rRNA analyses or removal of duplicates/near duplicates in high-throughput shotgun datasets.
GINKGO
A GUI software package designed for non-statisticians to perform multivariate analysis
LEfSe
LDA Effect Size is an algorithm for high-dimensional biomarker discovery and explanation that identifies metagenomic features (genes, pathways, or taxa) characterizing the differences between two or more biological conditions. In can be applied to taxonomic or functional abundance tables derived from metagenomic (WGS) data or 16S OTU/phylotype data.
metagenomeSeq
metagenomeSeq is an R package designed to determine features that are differentially abundant between two or more groups of multiple samples. metagenomeSeq implements both our novel normalization and statistical model accounting for under-sampling of microbial communities and may be applicable to other datatypes.
The HMP used an earlier iteration of this package, Metastats, which is no longer available.
MicrobiomeUtilities A set of software utilities for processing and analyzing 16S rRNA genes including generating NAST alignments, chimera checking, and assembling paired 16S rRNA reads according to reference sequence homology
Mothur
A platform-independent software package for describing and comparing microbial communities. Mothur incorporates the functionality of a number of computational tools, calculators & visualization tools into a single program
Qiime
'Quantitative Insight Into Microbial Ecology'. Qiime allows a range of community analyses suitable for microbiome data using traditional and high-throughput sequencing methods
R-package: Hypothesis Testing and Power Calculations for Comparing Metagenomic Samples from HMP
This R-package provides several functions to perform formal hypothesis testing on the species abundance distribution of human microbiome data, and to calculate power and sample size requirements for human microbiome experiments.
R-package: Statistical Object Oriented Data Analysis of RDP-based Taxonomic trees from Human Microbiome Data: Modeling, Visualization, and Two-Group Comparison
This R-package introduces Object Oriented Data Analysis (OODA) methods to analyze Human Microbiome taxonomic trees directly, providing tools to model, compare, and visualize populations of taxonomic tree objects.
Simrank
A rapid and sensitive general-purpose k-mer search tool
speciateIT
A package for speciation of 16S sequences
Unifrac
A suite of tools for the comparison of microbial communities using phylogenetic information. It takes as input a single phylogenetic tree that contains sequences derived from at least two different environmental samples and a file describing which sequences came from which sample. Unifrac is no longer available as a standalone tool, however has been incorporated into Qiime and mothur.
Online Resources
Greengenes
A 16S rRNA gene database and workbench compatible with ARB
RDP
Provides ribosome related data and services to the scientific community, including online data analysis and aligned and annotated Bacterial and Archaeal small-subunit 16S rRNA sequences
SitePainter
SitePainter allows users to visualize the different HMP body sites based on gradients of colors to represent available datasets

Sampling, Sequencing & Analysis of Whole Metagenomic Sequence

Downloadable Tools
BMTagger
NCBI's Best Match Tagger for removing human reads from metagenomics datasets. All HMP metagenomic sequence submitted to NCBI's Sequence Read Archive is being human filtered using BMTagger.
DeconSeq
Automatically detects and efficiently removes any type of sequence contamination from metagenomic datasets, including human or other host sequences. The tool uses a modified version of the BWA-SW aligner and can be applied to longer-read datasets (150+bp read length). DeconSeq is available as both standalone and web-based versions.
DNAclust
DNAclust is a fast clustering algorithm specifically designed for high-stringency clustering of DNA sequences, e.g. for 16S rRNA analyses or removal of duplicates/near duplicates in high-throughput shotgun datasets.
FragGeneScan
A short read gene finder
GINKGO
A GUI software package designed for non-statisticians to perform multivariate analysis
HUMAnN
The HMP Unified Metabolic Analysis Network (HUMAnN) is a pipeline for efficiently and accurately determining the presence/absence and abundance of microbial pathways in a community from metagenomic data (WGS). The pipeline converts sequence reads into coverage and abundance tables summarizing the gene families and pathways in one or more microbial communities.
LEfSe
LDA Effect Size is an algorithm for high-dimensional biomarker discovery and explanation that identifies metagenomic features (genes, pathways, or taxa) characterizing the differences between two or more biological conditions. In can be applied to taxonomic or functional abundance tables derived from metagenomic (WGS) data or 16S OTU/phylotype data.
Metamos
MetAmos is a pipeline for metagenomic assembly. It includes a collection of utilities for performing the assembly and for analyzing assembly output.
MetaPhlAn
A computational tool for profiling the composition of microbial communities from metagenomic data (WGS). MetaPhlAn relies on unique clade-specific marker genes identified from reference genomes, allowing very fast computational times, unambiguous taxonomic assignments, and species-level resolution.
Metapath
Metapath is a statistical package for comparing metagenomic data-sets at the pathway level (using KEGG pathway information). Metapath relies on a graph-theoretic definition of statistical significance in order to identify pathway motifs that differ between samples from two treatment populations.
METAREP
An open source tool to help scientists to view, query, browse, and compare metagenomics annotation data derived from ORFs called on metagenomics reads or assemblies (also available as an Online Resource)
metagenomeSeq
metagenomeSeq is an R package designed to determine features that are differentially abundant between two or more groups of multiple samples. metagenomeSeq implements both our novel normalization and statistical model accounting for under-sampling of microbial communities and may be applicable to other datatypes.
The HMP used an earlier iteration of this package, Metastats, which is no longer available.
PRINSEQ
A sequence processing tool that can be used to filter, reformat and trim genomic and metagenomic sequence data. It generates summary statistics of the input in graphical and tabular formats that can be used for quality control steps. PRINSEQ is available as both standalone and web-based versions.
Simrank
A rapid and sensitive general-purpose k-mer search tool
TagCleaner
Automatically detects and efficiently removes tag sequences (e.g. WTA or MID tags) from metagenomic datasets. TagCleaner is available as both standalone and web-based versions.
Online Resources
Biocyc
A collection of Pathway/Genome Databases (PGDBs). Each PGDB describes the genome and metabolic pathways of a single organism. The MetaCyc database was used for HMP metabolic reconstruction.
IMG/M
Provides tools for analyzing the functional capability of microbial communities based on their metagenome sequence, in the context of reference isolate genomes included from the Integrated Microbial Genomes (IMG) system
MG-RAST
A fully-automated service for annotating metagenome samples, providing annotation of sequence fragments, phylogenetic classification, metabolic reconstructions and comparison tools

Protocols

Standard operating protocols used by, or developed as part of the HMP are provided here.

Please be aware that HMP1 funding ended in 2012, and therefore certain protocols may refer to resources that have changed, moved or been discontinued. This list is no longer regularly maintained.

Microbial Reference Genomes

Reference Genomes Database
HMP single cell MDA 16S rRNA Sanger sequencing SOP
Strain selection guidelinesGuidelines for Reference Genome Strain selection
BEI contamination protocol

HMP Sequencing Center-specific Annotation Protocols

The initial set of 178 Bacterial Reference Genomes described in the 2010 publication, a Catalog of Reference Genomes from the Human Microbiome, were annotated using individual sequencing center methodologies:

Consensus Annotation Protocols

Subsequent Reference Genomes have been annotated using a consensus protocol for gene calling & functional annotation:
Provisional Reference Genome Assembly Metrics A set of quality control metrics run on every HMP Reference Genomes to ensure accuracy, completeness and continuity of draft and improved assemblies
Bacterial Core Gene Evaluation Protocol describing use of the Core Gene Evaluation Script to assess completeness of bacterial draft assemblies
Archaeal Core Gene Evaluation Protocol describing use of the Core Gene Evaluation Script to assess completeness of archaeal draft assemblies

Sampling, Sequencing, & Analyses of 16S RNA

Manual of Procedures (MOP) A reference document for current National Institutes of Health (NIH) policies and procedures as they apply to the Human Microbiome Project (HMP) Core Microbiome Sampling study

MOP Updates Please download the MOP Supplement PDF for updates to product information and links.

Study participant consent forms can be found on the Microbiome Analysis page, under the Sample Collection tab.
Core Microbiome Sampling Protocol
16S Data Flow for HMP Sequencing Centers Guidelines for the HMP sequencing Centers for submitting 16S rRNA gene data and metadata to the iHMP DCC
HMP 16S 454 protocol
Human Sequence Removal
SFF and Library Metadata File Generation
16S rRNA mothur Curation Pipeline
QIIME Community Profiling SOP

Sampling, Sequencing & Analysis of Whole Metagenomic Sequence

Manual of Procedures (MOP) A reference document for current National Institutes of Health (NIH) policies and procedures as they apply to the Human Microbiome Project (HMP) Core Microbiome Sampling study

MOP Updates Please download the MOP Supplement PDF for updates to product information and links.

Study participant consent forms can be found on the Microbiome Analysis page, under the Sample Collection tab.
Core Microbiome Sampling Protocol
Human Sequence Removal
HMP WGS Read Processing
HMP Whole-Metagenome Assembly
Body Site Assembly
Metagenomics Annotation SOP
GO Slim Analysis
Functional Database SOP
HUMAnN SOP
HMP Hybrid Assembly

Other Analysis

Walkthroughs

Walkthroughs are step-by-step tutorials taking users through typical HMP analysis paths, complete with sample datasets, details steps, screenshots and example output. These are geared toward educating researchers, particularly those without extensive bioinformatics infrastructures or experience, on utilizing selected tools and resources to reproduce HMP analyses, using HMP-generated or personal data as input.

img

Initial HMP walkthroughs utilize CloVR, a desktop application integrating state-of-the-art genomic tools in a robust, user friendly, fully automated software package with optional support for cloud computing platforms. CloVR is distributed as a portable virtual machine launched on a desktop or laptop under VMware or Virtualbox.

If you have questions about current walkthroughs, or would like to suggest additional walkthroughs, provide feedback or participate in beta testing of future walkthroughs, please contact us via the feedback form.

I. iHMP- DCC 16S CloVR walkthrough

iHMP- DCC 16S CloVR walkthrough

CloVR-16S supports 16S ribosomal RNA sequence analysis to study microbial community compositions. It processes short and long sequence reads from Sanger as well Roche/454 sequencing, including sequence reads generated with the multiplex amplicon 454 pyrosequencing protocol with specifically tagged or barcoded 16S rRNA PCR primers. The CloVR-16S pipeline employs several well-known phylogenetic tools and protocols:

  • QIIME - a Python-based workflow package, allowing for sequence processing and phylogenetic analysis using different methods including the phylogenetic distance metric UniFrac, UCLUST,PyNAST and the RDP Bayesian classifier;
  • 2UCHIME - a tool for rapid identification of chimeric 16S sequence fragments;
  • Mothur - a C++-based software package for 16S analysis;
  • Metastats and custom R scripts used to generate additional statistical and graphical evaluations.

This walkthrough uses HMP 16S rRNA sequences representing communities extracted from 12 hard-palate and 12 attached-keratinized gingiva oral sites.

II. iHMP DCC Metagenomics CloVR walkthrough

iHMP DCC Metagenomics CloVR walkthrough

The CloVR-Metagenomics protocol supports the analysis of shotgun sequencing data from total metagenomic DNA sequencing projects. This pipeline utilizes a number of well-known tools for analysis of metagenomic data:

  • UCLUST first clusters redundant sequences that show 99% nucleotide identity and removes artificial 454 replicate reads.
  • Representative DNA sequences are searched against the NCBI COG database using BLASTX.
  • Representative DNA sequences are searched against the NCBI RefSeq database of finished prokaryotic genomes using BLASTN.
  • Metastats and CloVR-implemented R scripts are applied for additional statistical and graphical evaluations of the pipeline results.
  • CloVR-Metagenomics generates several output reports including taxonomic and functional abundance tables, statistical comparisons of feature abundances between user-defined populations, andheatmaps with unsupervised clusterings of all samples.

This walkthrough uses HMP wgs reads representing microbial communities extracted from the mid-vagina and vaginal introitus sites.

III. Human Contaminant Screening

Human Contaminant Screening

This pipeline uses the NCBI BMTagger (Best Match Tagger) tool to identify and remove human reads in metagenomic sequences. For this walkthrough, we use a mock dataset which consists of a 50:50 mix of human contaminant-screened reads from an HMP project and filtered human reads from a 1000 genomes project.

IV. Metagenomic Assembly

Metagenomic Assembly

This pipeline is used to generate a "Pretty Good Assembly" a reasonable attempt at reconstructing pieces of the organisms present in the community that are long enough to allow gene finding and other downstream analyses. This version of the pipeline uses SOAPdenovo v.1.04. The HMP Whole-Metagenome Assembly protocol provides a detailed description of the pipeline. For this walkthrough, we use a sample from the HMP Anterior Nares body site.

V. Alignment of Metagenomic Reads to Reference Genomes Using Bowtie

Alignment of Metagenomic Reads to Reference Genomes Using Bowtie

This walkthrough provides a simple example of how to set-up and run the Bowtie Aligner using the web-browser accessible CloVR dashboard, as well as analyze the resulting outputs. We shall align metagenomic WGS reads extracted from the Anterior Nares body site (sample SRS019215), to reference genome Staphylococcus aureus.

VI. HUMAnN (HMP Unified Metabolic Analysis Network)

HUMAnN (HMP Unified Metabolic Analysis Network)

The HUMAnN pipeline is used for efficiently and accurately determining the presence/absence and abundance of microbial pathways in a community from metagenomic data. Sequencing a metagenome typically produces millions of short DNA/RNA reads. HUMAnN takes these reads as inputs and produces gene and pathway summaries as outputs:

  • The abundance of each orthologous gene family in the community.
  • The presence/absence of each pathway in the community.
  • The abundance of each pathway in the community, i.e. how many copies of that pathway are present.

For this walkthrough, we use genes from the Anterior Nares body site (sample SRS019215).

VII. Digital Normalization of Metagenomic Reads

Digital Normalization of Metagenomic Reads

This pipeline uses the DigiNorm algorithm to normalize metagenomic reads, substantially reducing the size without any significant impact on the assemblies that will be generated.

This walkthrough, uses a sample dataset from the HMP Illumina WGS Reads - Sample SRS018671.

VIII. Gene Clustering

Gene Clustering

This pipeline takes gene predictions from metagenomic shotgun sequence data (assemblies or reads), and generates a non-redundant gene set, using USEARCH (Edgar, 2010).

Member Organizations