Current News
May 2013
Publications
Partner Resources
Tools & Technology
In addition to data generation, the HMP is invested in development of new tools & technologies for computational analysis. Here we provide information on funded technology development grants, and access to tools utilized by members of the HMP consortium. More information can be found in the menus above and on the NIH Common Fund Site.
Tools
All software, online resources and standard operating protocols used in, or developed as part of the HMP, will be accessible here as they become available.
If you have a protocol or software package that you would like to post on this site, or would like more information on the currently available content, please contact us via the feedback form.
Microbial Reference Genomes
| Downloadable Tools | |
| Core Gene Evaluation ScriptScreening for core gene sets as an indicator of completeness of draft genomes. This download includes a Perl script and required archaeal and bacterial core genes fasta and cluster files. | |
| Online Resources | |
|
IMG System A community resource for comparative analysis and annotation of publicly available genomes in a uniquely integrated context |
|
|
Pathogen Portal A set of web-based resources provided by the Bioinformatics Resource Centers (BRCs), focusing on organisms considered potential agents of biowarfare or bioterrorism or causing emerging or re-emerging diseases |
|
|
RAST Annotation Server A fully-automated service for annotating bacterial and archaeal genomes, leveraging data and procedures established within the SEED framework to provide high quality gene calling and functional annotation |
|
Sampling, Sequencing, & Analyses of 16S RNA
| Downloadable Tools |
|
DNAclust DNAclust is a fast clustering algorithm specifically designed for high-stringency clustering of DNA sequences, e.g. for 16S rRNA analyses or removal of duplicates/near duplicates in high-throughput shotgun datasets. |
|
GINKGO A GUI software package designed for non-statisticians to perform multivariate analysis |
|
InVUE A toolkit for rapid development of custom software packages for visualization and analysis of large datasets |
| LEfSe LDA Effect Size is an algorithm for high-dimensional biomarker discovery and explanation that identifies metagenomic features (genes, pathways, or taxa) characterizing the differences between two or more biological conditions. In can be applied to taxonomic or functional abundance tables derived from metagenomic (WGS) data or 16S OTU/phylotype data. |
|
Metastats Metastats is a statistical package for comparing metagenomicdata-sets. Metastats was specifically designed for comparing clinical data comprising two treatment populations (e.g. sick vs. healthy) each comprising multiple samples, however thesoftware will also work for small number of samples. Metastats identifies features of the samples that "explain" the difference between the treatment populations. The features can be OTUs (e.g. inferred from 16S data), taxonomic groups, or other groupings (genes, functional groups, etc.) for which count data are available. Metastats primarily relies on a non-parametric t-test and reverts to Fisher's exact test for sparse features. Additional tests (presence/absence, odds ratios, etc.) are currently being implemented. Metastats is available as a web service, as standalone R and C code, as well as part of the Mothur package. |
| MicrobiomeUtilities A set of software utilities for processing and analyzing 16S rRNA genes including generating NAST alignments, chimera checking, and assembling paired 16S rRNA reads according to reference sequence homology |
|
Mothur A platform-independent software package for describing and comparing microbial communities. Mothur incorporates the functionality of a number of computational tools, calculators & visualization tools into a single program |
|
Qiime 'Quantitative Insight Into Microbial Ecology'. Qiime allows a range of community analyses suitable for microbiome data using traditional and high-throughput sequencing methods |
|
R-package: Hypothesis Testing and Power Calculations for Comparing Metagenomic Samples from HMP This R-package provides several functions to perform formal hypothesis testing on the species abundance distribution of human microbiome data, and to calculate power and sample size requirements for human microbiome experiments. |
|
R-package: Statistical Object Oriented Data Analysis of RDP-based Taxonomic trees from Human Microbiome Data: Modeling, Visualization, and Two-Group Comparison This R-package introduces Object Oriented Data Analysis (OODA) methods to analyze Human Microbiome taxonomic trees directly, providing tools to model, compare, and visualize populations of taxonomic tree objects. |
|
Simrank A rapid and sensitive general-purpose k-mer search tool |
|
speciateIT A package for speciation of 16S sequences |
|
Unifrac A suite of tools for the comparison of microbial communities using phylogenetic information. It takes as input a single phylogenetic tree that contains sequences derived from at least two different environmental samples and a file describing which sequences came from which sample |
| Online Resources |
|
Fast-Unifrac Provides a suite of tools for the comparison of microbial communities using phylogenetic information |
|
Greengenes A 16S rRNA gene database and workbench compatible with ARB |
|
RDP Provides ribosome related data and services to the scientific community, including online data analysis and aligned and annotated Bacterial and Archaeal small-subunit 16S rRNA sequences |
|
SitePainter SitePainter allows users to visualize the different HMP body sites based on gradients of colors to represent available datasets |
Sampling, Sequencing & Analysis of Whole Metagenomic Sequence
| Downloadable Tools |
| BMTagger NCBI's Best Match Tagger for removing human reads from metagenomics datasets. All HMP metagenomic sequence submitted to NCBI's Sequence Read Archive is being human filtered using BMTagger. |
| DeconSeq Automatically detects and efficiently removes any type of sequence contamination from metagenomic datasets, including human or other host sequences. The tool uses a modified version of the BWA-SW aligner and can be applied to longer-read datasets (150+bp read length). DeconSeq is available as both standalone and web-based versions. |
| DNAclust DNAclust is a fast clustering algorithm specifically designed for high-stringency clustering of DNA sequences, e.g. for 16S rRNA analyses or removal of duplicates/near duplicates in high-throughput shotgun datasets. |
| FragGeneScan A short read gene finder |
| GINKGO A GUI software package designed for non-statisticians to perform multivariate analysis |
| HUMAnN The HMP Unified Metabolic Analysis Network (HUMAnN) is a pipeline for efficiently and accurately determining the presence/absence and abundance of microbial pathways in a community from metagenomic data (WGS). The pipeline converts sequence reads into coverage and abundance tables summarizing the gene families and pathways in one or more microbial communities. |
| InVUE A toolkit for rapid development of custom software packages for visualization and analysis of large datasets |
| LEfSe LDA Effect Size is an algorithm for high-dimensional biomarker discovery and explanation that identifies metagenomic features (genes, pathways, or taxa) characterizing the differences between two or more biological conditions. In can be applied to taxonomic or functional abundance tables derived from metagenomic (WGS) data or 16S OTU/phylotype data. |
| Metamos MetAmos is a pipeline for metagenomic assembly. It includes a collection of utilities for performing the assembly and for analyzing assembly output. |
| MetaPhlAn A computational tool for profiling the composition of microbial communities from metagenomic data (WGS). MetaPhlAn relies on unique clade-specific marker genes identified from reference genomes, allowing very fast computational times, unambiguous taxonomic assignments, and species-level resolution. |
| Metaphyler Metaphyler is a software tool for inferring the taxonomic composition of a microbial community from whole-metagenome (WGS) sequencing data. Metaphyler relies on alignments to a curated database of housekeeping genes. |
| Metapath Metapath is a statistical package for comparing metagenomic data-sets at the pathway level (using KEGG pathway information). Metapath relies on a graph-theoretic definition of statistical significance in order to identify pathway motifs that differ between samples from two treatment populations. |
| METAREP An open source tool to help scientists to view, query, browse, and compare metagenomics annotation data derived from ORFs called on metagenomics reads or assemblies (also available as an Online Resource) |
| Metastats Metastats is a statistical package for comparing metagenomicdata-sets. Metastats was specifically designed for comparing clinical data comprising two treatment populations (e.g. sick vs. healthy) each comprising multiple samples, however thesoftware will also work for small number of samples. Metastats identifies features of the samples that "explain" the difference between the treatment populations. The features can be OTUs (e.g. inferred from 16S data), taxonomic groups, or other groupings (genes, functional groups, etc.) for which count data are available. Metastats primarily relies on a non-parametric t-test and reverts to Fisher's exact test for sparse features. Additional tests (presence/absence, odds ratios, etc.) are currently being implemented. Metastats is available as a web service, as standalone R and C code, as well as part of the Mothur package. |
| PRINSEQ A sequence processing tool that can be used to filter, reformat and trim genomic and metagenomic sequence data. It generates summary statistics of the input in graphical and tabular formats that can be used for quality control steps. PRINSEQ is available as both standalone and web-based versions. |
| Simrank A rapid and sensitive general-purpose k-mer search tool |
| TagCleaner Automatically detects and efficiently removes tag sequences (e.g. WTA or MID tags) from metagenomic datasets. TagCleaner is available as both standalone and web-based versions. |
| Online Resources |
| Biocyc A collection of Pathway/Genome Databases (PGDBs). Each PGDB describes the genome and metabolic pathways of a single organism. The MetaCyc database was used for HMP metabolic reconstruction. |
| IMG/M Provides tools for analyzing the functional capability of microbial communities based on their metagenome sequence, in the context of reference isolate genomes included from the Integrated Microbial Genomes (IMG) system |
| METAREP A suite of web based tools to help scientists to view, query, browse, and compare metagenomics annotation data derived from ORFs called on metagenomics reads or assemblies (also available as a stand alone tool) |
| MG-RAST A fully-automated service for annotating metagenome samples, providing annotation of sequence fragments, phylogenetic classification, metabolic reconstructions and comparison tools |
Protocols
All software, online resources and standard operating protocols used in, or developed as part of the HMP, will be accessible here as they become available.
If you have a protocol or software package that you would like to post on this site, or would like more information on the currently available content, please contact us via the feedback form.
Microbial Reference Genomes
| Reference Genomes Database |
| HMP single cell MDA 16S rRNA Sanger sequencing SOP |
| Strain selection guidelinesGuidelines for Reference Genome Strain selection |
| BEI contamination protocol |
|
HMP Sequencing Center-specific Annotation Protocols The initial set of 178 Bacterial Reference Genomes described in the 2010 publication, a Catalog of Reference Genomes from the Human Microbiome, were annotated using individual sequencing center methodologies: |
|
Consensus Annotation Protocols Subsequent Reference Genomes have been annotated using a consensus protocol for gene calling & functional annotation: |
| Provisional Reference Genome Assembly Metrics A set of quality control metrics run on every HMP Reference Genomes to ensure accuracy, completeness and continuity of draft and improved assemblies |
| Bacterial Core Gene Evaluation Protocol describing use of the Core Gene Evaluation Script to assess completeness of bacterial draft assemblies |
| Archaeal Core Gene Evaluation Protocol describing use of the Core Gene Evaluation Script to assess completeness of archaeal draft assemblies |
Sampling, Sequencing, & Analyses of 16S RNA
| Manual of Procedures (MOP)
A reference document for current National Institutes of Health (NIH) policies and
procedures as they apply to the Human Microbiome Project (HMP) Core Microbiome Sampling study MOP Updates Please download the MOP Supplement PDF for updates to product information and links. Study participant consent forms can be found on the Microbiome Analysis page, under the Sample Collection tab. |
| Core Microbiome Sampling Protocol |
| 16S Data Flow for HMP Sequencing Centers Guidelines for the HMP sequencing Centers for submitting 16S rRNA gene data and metadata to the HMP DACC |
| HMP 16S 454 protocol |
| Human Sequence Removal |
| SFF and Library Metadata File Generation |
| 16S rRNA mothur Curation Pipeline |
| QIIME Community Profiling SOP |
Sampling, Sequencing & Analysis of Whole Metagenomic Sequence
| Manual of Procedures (MOP)
A reference document for current National Institutes of Health (NIH) policies and
procedures as they apply to the Human Microbiome Project (HMP) Core Microbiome Sampling study MOP Updates Please download the MOP Supplement PDF for updates to product information and links. Study participant consent forms can be found on the Microbiome Analysis page, under the Sample Collection tab. |
| Core Microbiome Sampling Protocol |
| Human Sequence Removal |
| HMP WGS Read Processing |
| HMP Whole-Metagenome Assembly |
| Body Site Assembly |
| Metagenomics Annotation SOP |
| GO Slim Analysis |
| Functional Database SOP |
| HUMAnN SOP |
| HMP Hybrid Assembly |
Other Analysis
Walkthroughs
Walkthroughs are step-by-step tutorials taking users through typical HMP analysis paths, complete with sample datasets, details steps, screenshots and example output. These are geared toward educating researchers, particularly those without extensive bioinformatics infrastructures or experience, on utilizing selected tools and resources to reproduce HMP analyses, using HMP-generated or personal data as input.
Initial HMP walkthroughs utilize CloVR, a desktop application integrating state-of-the-art genomic tools in a robust, user friendly, fully automated software package with optional support for cloud computing platforms. CloVR is distributed as a portable virtual machine launched on a desktop or laptop under VMware or Virtualbox.
If you have questions about current walkthroughs, or would like to suggest additional walkthroughs, provide feedback or participate in beta testing of future walkthroughs, please contact us via the feedback form.
I. HMP- DACC 16S CloVR walkthrough
I. HMP- DACC 16S CloVR walkthrough
CloVR-16S supports 16S ribosomal RNA sequence analysis to study microbial community compositions. It processes short and long sequence reads from Sanger as well Roche/454 sequencing, including sequence reads generated with the multiplex amplicon 454 pyrosequencing protocol with specifically tagged or barcoded 16S rRNA PCR primers. The CloVR-16S pipeline employs several well-known phylogenetic tools and protocols:
- QIIME - a Python-based workflow package, allowing for sequence processing and phylogenetic analysis using different methods including the phylogenetic distance metric UniFrac, UCLUST,PyNAST and the RDP Bayesian classifier;
- 2UCHIME - a tool for rapid identification of chimeric 16S sequence fragments;
- Mothur - a C++-based software package for 16S analysis;
- Metastats and custom R scripts used to generate additional statistical and graphical evaluations.
This walkthrough uses HMP 16S rRNA sequences representing communities extracted from 12 hard-palate and 12 attached-keratinized gingiva oral sites.
II. HMP DACC Metagenomics CloVR walkthrough
II. HMP DACC Metagenomics CloVR walkthrough
The CloVR-Metagenomics protocol supports the analysis of shotgun sequencing data from total metagenomic DNA sequencing projects. This pipeline utilizes a number of well-known tools for analysis of metagenomic data:
- UCLUST first clusters redundant sequences that show 99% nucleotide identity and removes artificial 454 replicate reads.
- Representative DNA sequences are searched against the NCBI COG database using BLASTX.
- Representative DNA sequences are searched against the NCBI RefSeq database of finished prokaryotic genomes using BLASTN.
- Metastats and CloVR-implemented R scripts are applied for additional statistical and graphical evaluations of the pipeline results.
- CloVR-Metagenomics generates several output reports including taxonomic and functional abundance tables, statistical comparisons of feature abundances between user-defined populations, andheatmaps with unsupervised clusterings of all samples.
This walkthrough uses HMP wgs reads representing microbial communities extracted from the mid-vagina and vaginal introitus sites.
III. Human Contaminant Screening
III. Human Contaminant Screening
This pipeline uses the NCBI BMTagger (Best Match Tagger) tool to identify and remove human reads in metagenomic sequences. For this walkthrough, we use a mock dataset which consists of a 50:50 mix of human contaminant-screened reads from an HMP project and filtered human reads from a 1000 genomes project.
IV. Metagenomic Assembly
This pipeline is used to generate a "Pretty Good Assembly" a reasonable attempt at reconstructing pieces of the organisms present in the community that are long enough to allow gene finding and other downstream analyses. This version of the pipeline uses SOAPdenovo v.1.04. The HMP Whole-Metagenome Assembly protocol provides a detailed description of the pipeline. For this walkthrough, we use a sample from the HMP Anterior Nares body site.
V. Alignment of Metagenomic Reads to Reference Genomes Using Bowtie
V. Alignment of Metagenomic Reads to Reference Genomes Using Bowtie
This walkthrough provides a simple example of how to set-up and run the Bowtie Aligner using the web-browser accessible CloVR dashboard, as well as analyze the resulting outputs. We shall align metagenomic WGS reads extracted from the Anterior Nares body site (sample SRS019215), to reference genome Staphylococcus aureus.
Funded Tools & Technology Research
The HMP roadmap initiative calls for the development of new tools & technologies, informatics capabilities and resources needed for the advancement of the field of metagenomics. The data sets produced by metagenomic sequencing and related components will be very large and complex, requiring novel analytical tools for distilling useful information from vast amounts of sequence data, functional genomic data and subject metadata.
As well, whole genome sequencing technologies are currently limited to the relatively small class of microbes that can be cultured. In order to maximize the number of sequences available in the reference set, new techniques must be developed to culture or otherwise isolate for analysis currently unculturable organisms. In the long-term, methods for sequencing individual microbes or otherwise analyzing all of the members of complex populations will substantively advance this field.
HMP funded projects are presented here. More information can be found by clicking on each project. As these technologies are further described and new tools become public, we will make information available on the DACC. Additional details are available on the NIH Common Fund Site.
New Technologies
New Tools
Species-by-Species Dissection of Microbiomes using Phage Display and Flow Sorting
Metagenomics is a new scientific discipline that has developed in the last several years. It is both a set of research techniques, comprising many related approaches and methods, and a research field. As a scientific field, metagenomics attempts to resolve four tiers of questions:
1) what micro-organisms are present in a particular complex microbiome, such as human gut?
2) in what proportions?
3) what they are doing? and
4) how will they react to environmental changes, such as a change in diet?
Currently, the approach to answer these questions has been one of brute force shotgun sequencing and 16S sequence surveys. However these technologies can only hint as to which kinds of bacteria are present, and the information provided tends to be biased to the commonest species. The majority of bacteria in complex microbiomes cannot presently be cultured and sequenced in a conventional way. Whole genome amplification (WGA) from single cells has been used in several studies. However, in the best cases, only 60% of a genome can be covered with the DNA obtained with WGA from a single cell. Studies have showed that the bias from WGA is random and coverage can be improved by adding more copies of the same genome. We propose here to change the metagenomics paradigm: rather than extracting all DNA in bulk without any independent information on the species that comprise it, we propose to develop tools to be able to analyze species one by one.
This will be carried out by using phage display to select antibodies that recognize species in the population, and then to use such selected antibodies to characterize the abundance of the species by flow cytometry, purify it, and if necessary deplete the population of the species in order to repeat the process. The purified bacteria will be used as starting material for whole genome amplification, species characterization by rRNA analysis, and sequencing, if necessary. The antibodies developed within this proposal will be used to carry out the analyses indicated. Those developed within the context of the analysis of the human gut microbiome will also be very useful within the context of clinical studies in which bacterial composition may play an etiological role. An artificial bacterial mixture of E. coli and several other bacterial species will be used at the first stage of method development, and the microbiota in human gut will be analyzed in the later portion of the project.
Targeted genomic characterization of uncultured bacteria from the human microbiot
A major goal of the Human Microbiome Project is to identify all of the organisms that are associated with the human body (the human microbiota) and determine the genomic sequence of most if not all of them. The detected diversity of the human microbiota reaches thousands of species and strains, the vast majority of which have not been isolated in pure culture. Our goal is to develop a robust and rapid approach for the targeted genomic characterization of any uncultured constituent of the human microbiota at single cell level and also to allow population genetic studies of selected groups of organisms that may have some cultured isolates. Our strategy utilizes the high phylogenetic resolution that the small subunit ribosomal RNA (SSU rRNA) provides in distinguishing microbial phylotypes. We plan to label and isolate single cells representing uncultured microbial lineages as well as populations of cells of specific phylotypes from complex microbiota samples and amplify their DNA to levels that enable genomic sequencing. This approach will bridge the gap between sequencing the limited number of individual cultured organisms and whole community shotgun sequencing (metagenomics) which generally does not provide sufficient depth and resolution to comprehensively sequence the microbiome. Initial feasibility studies indicate that our approach can be applied to any microbial consortia and is not dependent on the abundance of the target organism. Based on this, the focus of this proposal is to determine optimum experimental design and improved technical procedures for targeted single cell and population genomics of microbes from the human microbiota. The specific aims are to:
1. Aim 1. Separate single cells and populations of target uncultured microbial phylotypes from gut microbiota samples. We will use fluorescence in situ hybridization (FISH) combined with flow cytometry to obtain single cells and populations of targeted phylotypes, uncultured or with few cultured representatives
2. Aim 2. Amplification and sequencing of genomes from single cells representing the uncultured gut microbiota. We will amplify the genomes of target cells using multiple displacement amplification and sequence the DNA to obtain draft genomic assemblies. The experimental and computational approaches will be optimized for the human microbiota characteristics.
3. Aim 3. Pangenomic characterization of targeted populations of uncultured and cultured microbial phylotypes. We will isolate populations of specific bacterial phylotypes representing uncultured organisms as well as cell populations representing species/genera that have representatives in culture and one or few genomes sequenced. We will amplify and sequence the cell population genomic DNA to obtain composite genomes/pangenomes.
FISH 'N' Chips: A Microfluidic Processor for Isolating and Analyzing Microbes
Since uncultivable microorganisms comprise a large percentage of the microbiome, and are likely to play a major role in the ecology at all sites within the body, it is critical to develop new approaches to obtain samples of these microorganisms for genomic analysis. In this proposal we focus on one anatomic site, the mouth, and propose to develop a technology to extract single bacterial cells from saliva. To attain these goals, we have formed a collaboration between Sandia (with expertise in integrated microfluidic technology for biological analysis), NYU College of Dentistry (with expertise in oral microbiomics and oral-based diagnostics), and the Joint Genome Institute (with expertise in microbial ecology and sequencing). The technological approach is to build an integrated microfluidic cell processor that will identify, select, and isolate into discrete microdroplets single bacteria from a mixture of oral bacteria from human saliva. The microfluidic processor will have multiple modules to 1) perform fluorescence in situ hybridization on a mixture of bacteria, 2) sort single cells using fluorescence activated photonic-force deflection, and 3) encapsulate sorted cells in microdroplets before depositing them on an array. The input to the device will be bacterial cells from saliva and the output will be arrayed droplets containing no more than one bacterium. We will first characterize and validate this processor using a mixture of pure bacterial cultures. Subsequently we will take salivary samples, deplete them of abundant bacterial species, and isolate individual cells, using specific 16S probes. Metagenomic analysis on the entire population of bacteria in saliva will be used to identify new bacterial sequences. With new sequence information, we will design 16S probes to isolate previously uncharacterized organisms for genomic testing. Isolated cells will be characterized as cultivable or non-cultivable, and known (sequenced) or unknown. Ultimately this technique will be used to extract sequence-quality genomic DNA from individual microorganism and can be used as a diagnostic to identify bacterial signatures obtained from healthy versus diseases subjects.
Functional Sorting of Microbial Cells From Complex Microbiota
Microbial communities play a significant role in maintaining human health. However, understanding the complex relationship between a human host and its resident microbial flora presents a considerable challenge. For example, the total number of microbial cells residing in an individual is estimated to far outnumber the individual's somatic cells. Unfortunately, the identity, distribution, and functional significance of the majority of these microorganisms are unknown. The situation is further complicated by the inability to culture many of these organisms in the laboratory. Here, we propose the development of a microfabricated device that enables the parallel culturing and characterization of individual members of a microbial community. Single cells will first be encapsulated into alginate gel microdroplets to allow for small-scale growth of thousands of isolated cells in parallel. Segregation of a cell into a gel bead will also facilitate subsequent sorting and selection based on the metabolic profile of the cell, which will be assessed using fluorogenic enzyme substrates. By employing a panel of different substrates, a large number of different species can be distinguished based on their metabolic properties. This approach will allow for quantification of the relative distribution and functional capabilities of the different members of the consortia and for subsequent genetic analyses. The scale, throughput capabilities, and sensitivity of the proposed technology address the key challenges facing the analysis of microbial consortia. Demonstration of this "front-end" sample preparation technique will greatly facilitate subsequent genome sequencing and interpretation of the complex relationship between a human host and its resident microbial flora.
Multi-Dimensional Separation of Bacteria
Efforts to understand the complex relationship between microbes and their hosts are complicated by the large number of nonculturable organisms, and the heterogeneity even within each species. While modern metagenomics approaches admirably sample the identity of microbes, bulk studies limit the other inferences that can be derived from each bacteria. In order to overcome this problem, yet retain clues to the diversity of the original population, we propose to: Model, design, build, and test a multidimensional, microfluidic sorter based on both structural (size-and shape and Electrophoresis) and functional (Adhesion, Chemotaxis) parameters to separate a complex bacterial mixture into bins containing bacteria that share common properties. The identities of the sorted bacteria will be obtained through metagenomic studies. The device will enable determination of the heterogeneity both between and within species in a complex mixture. This microfluidic separation device will utilize
1. Asymmetric pinched flow fractionation to separate bacteria based on size and shape.
2. Electrophoretic based flow fractionation to separate bacteria based on surface charge.
3. Functionalized magnetic beads to separate bacteria based on adhesion to extracellular matrix (ECM) components.
4. Chemotaxis to separate bacteria based on their motile response to chemical stimuli, and lastly,
5. Multi separation modalities to separate bacteria based on size, shape, adhesion, response to chemical stimuli, and surface charge.
We will use microfluidic approaches since (i) the feature sizes of microfluidic systems are compatible with the size of the bacteria; (ii) complicated flow paths can be machined with ease and at low cost; (iii) many sorting modules utilizing diverse principles can be integrated into a single device. Within each specific aim we rely heavily on direct numerical simulation of particle movement using code that reflects 2-dimensional geometry. As part of the experimental plan, we will expand functionality of the our custom Particle Mover program to a full 3-D simulation. Once design parameters have been established, devices will be fabricated and tested rigorously using particles, mixtures of known bacteria, and for the 3 and 4-stage devices, complex mixtures from human subjects. We will make use of a modular architecture that facilitates interchangeability of modules. The long term goal is to add other separation modalities into the device and to integrate into the device modules for single cell isolation, DNA isolation and amplification on-chip, to permit high-throughput analysis of complex mixtures. These studies will lead to devices that not only capture the diversity of complex mixtures, but also permit direct assignment of the heterogeneity of structural and functional properties, genes and gene products within each single species in the mixture, and aid understanding of human disease.
An Integrated lab-on-chip system for genome sequencing of single microbial cells
There are approximately 100 trillion microorganisms inhabited in a human body. The intimate interactions between these microorganisms and the host have a profound impact of the physiology of the human body. The majority of these microorganisms have yet to be fully characterized, mainly due to the difficulties in growing them in laboratory conditions. Here we propose the development of an efficient and scalable method to obtain genome sequences from single microbial cells. This will eliminate the need to obtain pure laboratory culture, thus allow us to systematically characterize the genome structure of microbial communities that resides in different parts of the human body. The proposed project contains three major components.
1. Development of low-cost and disposable cell sorting devices that integrate microfluidics with micro- scale optical components. Such a lab-on-chip cell sorter will be able to identify and isolate single microbial cells from samples that contain hundreds to thousands of different species and often are contaminated with free DNAs from the host and other sources.
2. Development of a micro-well based polymerase cloning device for simultaneous amplification of genomes from hundreds to thousands of single microbial cells in parallel. Using such a device, we will prepare sufficient amount of DNAs from single cells for whole genome shotgun sequencing, a critical step to obtain genome sequences from single cells.
3. Development of an integrate pipeline for dissecting the genome composition of human microbiome at the single cell resolution. We will test this pipeline by isolating and sequencing genome from approximately 35 single microbial cells at different levels of relative abundance from the mouse distal intestine. The propose method will provide the research community a new tool to identify unknown microbial species, to study their metabolic functions and to better understand the host-microbe interactions under various physiological and disease conditions.
SCODA DNA extraction to normalize species representation
To perform a metagenomic analysis of the Human Microbiome, normalization of species abundances prior to library construction is required to allow for both cost effective sequencing by avoiding redundancy of over represented organisms, and also to allow the detection and sequencing of very low abundance species. As a result, there exists a need for a robust DNA purification method that can efficiently extract DNA while rejecting contaminants, that can length select during the extraction process, and that can enrich DNA pools for low abundance sequences to ensure that as many organisms as possible can be detected. We aim to apply our SCODA technology for concentrating and purifying nucleic acids to perform integrated extraction and fragment length selection in a single automated step, in order to reduce the time and labor required to produce clone libraries from contaminated samples. In addition, we aim to develop sequence- specific concentration of nucleic acids as a means to enrich DNA populations for very low abundance species, and for targeted recovery of low abundance genomes.
Optimization of a microfluidic device for single bacterial cell genomics
The quest to characterize the human microbiome is a daunting goal, but one that promises to enhance significantly our understanding of health and our management of a wide variety of disease states. In this quest, two features of the human microbiota in particular, pose major challenges: the large proportion and number of as-yet uncultivated species, and the extreme unevenness of the microbial communities, with a resulting large number of potentially important community members that fail to be "seen" in routine surveys. The ability to identify, isolate, and sequence the genome of single bacterial cells would allow us to characterize and understand both rare and uncultivated microbial species, and materially advance our understanding of the human microbiome. In recent work, a microfluidic device has been designed and fabricated, with features that mimic an integrated electrical circuit; this device isolates individual bacterial cells, and allows their genome to be amplified in nanoliter volumes. In this Application, a plan is proposed for optimization and augmentation of this microfluidics device, so that environmental contamination is reduced, rare cell types are more easily captured, larger numbers of cells are screened more quickly, and gene expression is more easily measured from single cells. The long-term objectives of this work are to enhance our understanding of the human microbial communities, and in particular, of novel or poorly-characterized, uncultivated microbial community members. This proposal responds to critical unmet needs posed by the NIH Human Microbiome Project. The following are the Specific Aims of this proposal:
Aim 1. To reduce the contribution of environmental DNA to single cell genomic sequence data, and increase the "signal-to-noise" ratio of the sequence data obtained with our cell- sorting, genome amplification microfluidics device. The experimental approach involves the integration of optical (laser) tweezers into the device.
Aim 2. To improve the ability to detect and capture rare microbial community members with the microfluidics device. The experimental approach involves the integration of fluorescence in situ hybridization techniques, specific probes, and fluorescence imaging with the microfluidics device.
Aim 3. To increase the speed of single cell selection and isolation with the microfluidics device. The experimental approach involves more highly parallel microdevice designs, optimization of laser power and laser optical path, and further automation of cell manipulations.
Aim 4. To enhance the capability for gene expression analysis in single bacterial cells. The experimental approach involves the development of on-chip protocols for RNA isolation, reverse transcription, and use of digital PCR to quantify transcript abundance from single cells.
Cultivation and Characterization of Microaerobes from the Human Microbiome
Complex communities of microbes are intimately associated with all plants and animals in nature: they influence the evolution, physiology and ecology of the host. The specific roles of microbes in these symbiotic relationships have been best elucidated for that subset of microbes grown in pure culture. However, the application of cultivation-independent molecular surveys reveals that many of these microbes have yet to be cultivated. We propose to cultivate a physiological subset of novel microbes from the human microbiome - the microaerobes - by incorporating unique approaches to isolation and cultivation. We are focusing on microaerobes because oxygen diffusing into the GI tract from host tissue creates a microoxic zone adjacent to the tissue that is likely to be colonized by microaerobes. As a result of the proximity to the host tissue, these microbes are likely to interact directly with the host and so are key to understanding the role of the microbiome in human health and disease. Microbes are typically isolated under an atmosphere of 21% oxygen or strictly anoxic conditions. While these conditions are suitable for the cultivation of many organisms, microaerobes thrive under reduced concentrations of oxygen. They have specialized respiratory enzymes to harvest oxygen at low concentrations, and as a result occupy niches not available to typical aerobes. Microaerobes, including populations of Helicobacter and Campylobacter, occupy the GI tracts of many animals. We propose to extend the availability of cultured microaerobes from the human microbiome as follows:
1) Exploit microoxic atmospheres and novel cultivation strategies to isolate microaerobes from the mucosa of the human GI tract,
2) Select representative microaerobes based on their distribution and abundance in the GI tract and their estimated prevalence in the human population,
3) Provide a physiological characterization of representative isolates that are sent for genome sequencing.
Pure cultures of microaerophiles will not only enable direct sequencing of their genomes, but will be a valuable resource for genetic, biochemical and physiological experiments to test hypotheses generated from genome analyses. An improved understanding of the physiological ecology of these microbes and their impact on humans is most accessible when they are studied in pure and defined mixed cultures.
Technologies for the discovery of novel human colonic mucosal-associated microbes
There are many challenges to the study of the human enteric microbiome, among them solving practical issues such as obtaining undistorted and representative samples and determining how advanced technologies for discovery of uncharacterized, seemingly uncultivable, and poorly represented microorganisms can be applied to small sample sizes. However, studies to date have also failed to recognize that colonic lavage used to prepare the colon for standard colonoscopy significantly dilutes and distorts the enteric microbiome. Moreover, our knowledge of the human enteric microbiome is heavily based on analyses of stool and luminal samples which may not be sufficiently representative of the more residential and geographically-specific communities of mucosal- associated bacteria. In this regard, novel and underrepresented species that have special conditional or communal properties that facilitate their close proximity to the host are likely to be missed. These organisms are likely to have direct bearing on human health and disease. Regional differences in their composition and community organization must also be factored in when studying the human enteric microbiome. In this proposal, these issues will be taken into consideration in developing new and improved non-cultivation-based technologies that will ultimately facilitate genomic sequencing and metagenomic analysis of substantial numbers of previously uncharacterized members of the human enteric microbiome. We propose to obtain region-specific samples of mucosal associated microbes in their natural state within the unprepped human colon. We will then develop and refine two non-cultivation- based approaches aimed at obtaining high grade, composite DNA of microbial communities or enriched/purified samples of underrepresented, unclassified microbial species. The first involves laser capture microdissection of mucosal-associated bacteria from different regions of the human colon, which will be developed primarily for generating high quality DNA for metagenomic analyses. The second approach involves fluorescence in situ hybridization (FISH) using 16S rDNA and metagenomically-determined unique riboprobes coupled with fluorescence-activated cell sorting (FACS). Yield, enrichment, and purity will be optimized to discover and isolate novel, unclassified, and rare microbes from the human colonic microbiome for whole genome sequencing by high throughput sequencing centers. We believe these studies will produce non-cultivation-based technologies that will advance genomic sequencing and metagenomic analysis of substantial numbers of previously uncharacterized members of the human enteric microbiome.
Novel cultivation methods for the domestication of vaginal bacteria
The microbiota of the human vagina can profoundly affect the health of women and neonates. For instance, women with the condition bacterial vaginosis (BV) have increased risks of acquiring sexually transmitted infections such as HIV, and pregnant women with BV have increased risk of preterm birth. Our understanding of BV is hampered by the failure to cultivate many of the bacteria associated with this condition. PCR methods have demonstrated that novel and uncultivated bacterial species are common in women with BV and some uncultivated bacteria are associated with important adverse outcomes such as antibiotic failure and ascending infection. Microbial genome sequencing efforts hold the promise of providing new insights regarding the metabolic interactions among vaginal bacteria that help sustain these vaginal communities and the pathogenic capabilities of key species that mediate poor health outcomes. Conventional cultivation methods may fail to propagate human-associated bacteria for a variety of reasons. For example, growth in monoculture may be precluded when bacterial species grow together as metabolic (syntrophic) partners, exchanging critical nutrients. Likewise, one species may depend on a second species for a signaling compound that stimulates cell division, again precluding monoculture. Alternatively, conventional laboratory medium may fail to replicate the local environment found in a body surface or cavity. We seek to overcome these limitations using several different novel cultivation technologies and approaches, which are all based on the principle that cultivation of vaginal bacteria can best be achieved by better reproducing the natural vaginal microenvironment.
In Aim 1 we will use a novel miniaturized diffusion chamber device to isolate and propagate bacteria in microliter scale chambers. The in situ isolation chip (iChip) will allow bacteria to grow in pure culture while bathed in fluid from the natural environment. In this case, we will use vaginal lavage fluid for in vitro propagation of bacteria, and the human vagina itself for short term in vivo propagation of bacteria. The iChip serves as an intermediate for adapting bacteria for independent growth on laboratory media, a process that we call domestication.
In Aim 2, we will use co-cultivation methods to propagate fastidious vaginal bacteria in pure culture. This approach uses two growth chambers that are separated by a membrane permeable to chemicals but not bacteria. A known cultivated bacterium or bacterial community is inoculated into the lower chamber, and an uncultivated bacterial cell is inoculated in the upper chamber. Bacteria growing in the lower chamber are allowed to produce nutrients and growth factors to stimulate the proliferation of fastidious bacteria in the upper chamber. This strategy takes advantage of nutrient cycling and cross species signaling between bacterial species to allow isolation of fastidious microbes.
In Aim 3, we will use conventional media supplemented with sterile filtered vaginal lavage fluid to cultivate fastidious vaginal bacteria under diverse conditions with prolonged incubation. This media will contain missing nutrients and signaling molecules from the vaginal environment to facilitate isolation. Some novel vaginal bacteria are capable of laboratory propagation but require extended growth times and very specific growth conditions. Furthermore, some bacteria can be cultivated in the lab but are not easily identified using phenotypic methods. We will use 16S rRNA gene PCR to facilitate identification of all bacteria in this study.
PUBLIC HEALTH RELEVANCE: The vaginal microbiota has a major impact on the health of women and neonates. Bacterial vaginosis affects 29% of women in the United States and is associated with increased risk of sexually transmitted diseases, including HIV, preterm birth, pelvic inflammatory disease, and several other adverse outcomes. The microbiology of bacterial vaginosis is poorly understood and many bacterial species found in this condition have not been cultivated in the laboratory. This application seeks to cultivate numerous fastidious vaginal bacteria associated with bacterial vaginosis using several novel cultivation approaches with the goal of gaining new insights about the vaginal microbiome and its role in human health.
Confining single cells to enhance and target cultivation of human microbiome
Understanding the human microbiome is critical in maintaining human health and preventing disease, but it has been unclear how specific microbes affect health and disease because the majority of microbes cannot be cultivated using traditional methods. Technologies are needed that can both increase the success rate for cultivating microbes and target cultivation efforts towards microbes of high biomedical interest. This project will use microfluidic confinement to overcome the limitations of traditional cultivation and targeting methods by developing "single cell confinement technology". Stochastic confinement of single cells in droplets of small volumes (picoliters to nanoliters) will isolate microbial species and, potentially, enable cultivation of new microbes by initiating high-density growth from a single cell. The droplets created by this single cell confinement technology can be split to perform multiple assays in parallel on clonal sister populations, enabling killing assays to be performed to identify microbes in one sister population, and using of the other sister population for growth. In this technology, to target cultivation efforts towards microbes of biomedical interest, new species will be identified via two complementary approaches: gene-based assays and function-based assays. Gene-based assays, informed by existing metagenomic data, will identify desired functional genes and 16S sequences, and function-based assays will identify desired functions even if they are not associated with a known gene sequence. The identified microbial species will then be targeted for scale-up of microcolonies to make them available for sequencing and further study. We will develop and validate the single cell confinement technology by using sulfur-reducing bacteria from the human colon as the test system. Sulfur-reducing bacteria are of high biomedical importance, associated with Ulcerative Colitis and intra-abdominal infections, but are still poorly understood. We will first use a model consortium of gut-derived microorganisms, containing a representative sulfur-reducing bacterium Bilophila wadsworthia, to develop and optimize the technology. Next, we will develop gene-based and function- based assays and test them by identifying sulfate reducing bacteria in model mixtures. Finally, we will use these cultivation approaches and assays to cultivate and select new sulfur-reducing bacteria from the human colon. This technology will be generally applicable to identify and cultivate all classes of microbes in the human gut microbiome. This project will impact biomedical science and public health by developing and validating technologies for increasing our understanding of the relationship between genes and functions in the human gut microbiome, and therefore microbial contributions to both health and disease. PUBLIC HEALTH RELEVANCE: Statement Microbes are critical to the function of the gastrointestinal tract. Understanding their role in human health and disease requires cultivation, but the majority of the species in the human gut microbiome are difficult to cultivate. This application will develop confined-based technology to enhance cultivation of microbes from human colon and target the cultivation efforts by using complementary assays to identify microbes with genes and functions of high biomedical interest.
Culturing uncultivatable gut microorganisms
The majority of gut microbes remain uncultivatable, and this significant obstacle must be overcome to understand the role of the microbiome in human health. The goal of this project is to develop a high-throughput method to grow previously uncultivatable bacteria. Our previous work with uncultivatable microorganisms from the external environment has lead to a number of advances:
(1) it is possible to cultivate a substantial number of otherwise uncultivatable bacteria by growing them in situ. When microorganisms are placed into a diffusion chamber which is then returned to their natural environment, a substantial proportion of otherwise uncultivatable microorganisms will grow;
(2) reinoculation from chamber to chamber produces domesticated variants that can grow on synthetic media in vitro;
(3) many uncultivatable species will grow on synthetic media in co-culture with a cultivable organism from the same environment;
(4) we recently discovered the first growth promoting factors for uncultivatable bacteria. An assay-driven purification lead to the identification of siderophores as essential factors produced by helper organisms that trigger growth of uncultivatable bacteria from marine sediment. We find that growth co-culture can be used to obtain uncultivatable organisms from the gut flora. In this project, we will develop a high-throughput approach to co-culture in order to obtain a large collection of previously uncultivatable microorganisms from the gut microbiome. A panel of 24 cultivable gut species representing the main taxonomic groups will be arrayed in a microtiter plate and a platform carrying inserts with a 0.2 5m pore membrane will be placed in the wells. In this manner, each well will be separated into a bottom section inoculated with a given cultivable species and a top section connected with it through pores of the membrane. A suspension from a human fecal sample will then be separated by a cell sorter, and individual cells will be deposited in the upper chamber of each well. After incubation, material from both parts of a well will be collected and tested for growth of the two organisms separately and in co-culture. This will lead to the isolation of uncultivatable species and their helpers. 16S rRNA gene sequence determination will then identify these microorganisms. Whole genome sequencing will be performed for at least ten of the uncultivatable isolates from a variety of taxonomic groups. The genome sequencing will provide an ultimate validation of the proposed approach to obtain novel uncultivatable species from the microbiome. Growth factors will be isolated from the supernatant of corresponding helper organisms by bioassay-guided purification. Structures of the new compounds will be determined. The growth factors will then be examined, individually and in combination, for their ability to enable in vitro cultivation of uncultivatable microorganisms. The tools and approaches we develop are likely to lead to the cultivation of many gut bacteria, and will help us understand the role of the gut microbiome in health and disease. PUBLIC HEALTH RELEVANCE: The majority of gut bacteria are uncultivatable, and do not grow under laboratory conditions for unknown reasons. We find that many of these organisms depend on neighboring, cultivable species for growth. In this project, we will develop a method for large-scale isolation of previously uncultivatable microorganisms by pairing them with the correct helper species, which will enable their genome sequencing, and detailed study of their role in health and disease.
FACS-MABE: a method to sort and enrich the as-yet uncultured bacterial species from the human distal gut
Human health is intimately connected with the presence and activities of a wide range of microbial species that live on and within us (the microbiota). Characterization of these microbes will help us to understand how they influence human health. The Human Microbiome Project (HMP) aims to sequence the genomes from a large number of the thousands of bacterial species to be found within our microbiota for this purpose. When considering our gut microbiota, a major difficulty encountered is that the majority (~75%) of the many hundreds of bacterial species that reside there are as-yet uncultured, severely restricting the amount of research that can be done to characterize them fully in terms of their contributions to health. There is thus an urgent need to develop tools and methods to specifically access these as-yet uncultured species. Hypothesis. Within the human distal gut microbiota, approximately 5-10% of the species present account for ~99% of the total content. I hypothesize that many of the uncultured microbial species within the human gut are minority species that are numerically rare within the microbiota consortium, and that their rarity is controlled by growth constraints placed on them by the more dominant members of the consortium. I propose that manipulating this population by sorting the rare away from the more dominant species, will relieve this growth suppression, and that a combination of genome sequencing and semi-high throughput screening for optimal growth requirements will allow in vitro culture of these potentially medically-relevant minority species. Preliminary Studies. My laboratory has already provided the HMP with genomic DNA from over 100 bacterial isolates from the human gut, several of which are novel, previously uncultured species which were recovered simply by careful microbiological screening, paying close attention to the strict anaerobic environments and fastidious diets required by most of the gut microbiota species. We routinely use a specialized culture technique to model the gut bacterial community in vitro, and we have used spent culture media from these models to demonstrate the concept of in vitro growth suppression on culturable members of the community. Specific Aims. I propose to develop an innovative method to enrich for as-yet uncultured bacterial species from the human gut, using Fluorescence Activated Cell Sorting (FACS) combined with tailored magnetic antibodies. In addition, I propose a novel diffusion plate technique to rapidly screen for optimal growth conditions for recovered, enriched populations of as-yet uncultured organisms. Work Proposed. We will use fluorescent DNA probes to bind to molecular signatures of bacterial species within the gut microbiota, and FACS to sort the targeted strains. Recovered bacterial cells will be used to immunize mice, and resulting antibodies will magnetically labeled and used to enrich for live target bacterial species within the population. Enriched target populations will be subjected to genome sequencing and in vitro culture attempts using diffusion plates formulated with a wide range of potential growth substrates. PUBLIC HEALTH RELEVANCE: The bacterial community that resides within the human gut (the gut microbiota) is a highly complex consortium of hundreds of species that as a whole is poorly understood. Characterizing the bacterial species that comprise the gut microbiota is a current focus for research, as it is expected that this microbial community plays a pivotal role in human health. Since the majority of bacterial species within the gut microbiota have not yet been cultured in vitro, they remain largely unstudied. The ultimate goal of this research proposal is to develop techniques to enrich for and culture the as-yet uncultured bacterial species of the human gut microbiota, to render them accessible to detailed, health-driven research.
Isolation, selection, and polony amplification of single cells in a gel matrix
The human microbiome represents a largely undefined consortium of organisms that may play a role in human health and disease. New technologies are needed to isolate and sequence individual "reference" genomes for a better understanding of the complex microbial ecology of the human host. Specific Aims:
1) To test a mechanism for isolating and amplifying polymerase colonies (polonies) from the whole genome of single cells using solid phase PCR in a polyacrylamide hydrogel; and
2) to explore UV-photocatalysis as a method of selectively weakening microbial cell walls, thereby rendering the genome accessible to DNA polymerase.
Research Design: A whole genome amplification method that is sensitive to one genome equivalent (1-5 fg) of DNA was developed in this laboratory, based on tagged random hexamer PCR (T-PCR). The innovation comes from a modified primer design that stabilizes the polymerase and the primer-template complex. For this proposal, the two-step approach of genome tagging and amplification will be converted into two compatible solid-phase PCR reactions using thin layers of porous polyacrylamide hydrogel. In order to amplify whole genomes from single cells, standard microbiological plating techniques will be used to spatially isolate Escherichia coli cells onto the surface of the hydrogel and then sandwich the cells between the two reaction layers. The whole genome amplification method will be optimized for generating "sequencing-ready" DNA from individual isolated cells, with fragment size controlled by PCR extension time. Polonies generated by the T-PCR method will be recovered from the hydrogel and characterized by high throughput 454 Pyrosequencing to determine genome sequence coverage and any potential amplification biases. For the second phase of research, a complex microbial sample from the oral-salivary microbiome will be evaluated using solid-phase polony amplification. The potential for a diversity of different cell types requires an added step of cell lysis/selection, and UV photocatalysis will be explored as a means to weaken the microbial cell wall and improve susceptibility to heat lysis during PCR. Polonies generated in this manner will be screened by sequencing the 16S rRNA gene to assess microbial diversity of recovered whole genomes. Implications: If successful, this technology will provide a simple and readily accessible approach for spatially isolating and selecting single cells for whole genome amplification. PUBLIC HEALTH RELEVANCE: We will develop a metagenomics approach for plating, selecting, and amplifying whole genomic DNA from individual microbial cells in a hydrogel matrix. The ability to spatially isolate and amplify polymerase colonies (polonies) using solid-phase PCR will help expand the reference library of whole genome sequences from the human microbiome, leading to a better understanding of the microbial ecology of human health and disease.
Metagenomic dissection of the gut microbiota
The human gut microbiota makes important functional contributions to the host's metabolism and physiological traits, but remains largely unknown due to its complexity. To generate DNA materials of the microbial species from the human microbiota suitable for genomic sequencing, we propose to design, construct and test droplet based microfluidic devices for the co-cultivation and analysis of various subsets of the total gut microbiota. We will pursue two aims: Specific Aim 1: Design and construct microfluidic components for cell encapsulation, co-cultivation, and genetic characterization. Specific Aim 2: Test the devices with a synthetic model system and gut microbial samples from gnotobiotic animals. PUBLIC HEALTH RELEVANCE: Microbial communities in the gastrointestinal tract have been found to make functional contributions to the host's metabolism and physiological traits, such as digestion and immunity. To facilitate the cultivation of these gut microbes and to eventually elucidate the underlying microbe-microbe and host-microbe interactions, we propose to design and build a prototype microfluidic platform to compartmentalize, co-cultivate and analyze various subsets of the total gut microbiota.
Tools for human microbiome studies
The complex and dynamic communities of microbes that are present on and within the human body (the human microbiota) are thought to profoundly influence human health in a variety of ways, through effects on human physiology, nutrition, immunity, and development. Studies on humans and vertebrate animal models have generated evidence that this is the case for some specific diseases and suggest that further studies in this area may be vital for the understanding, prevention, and treatment of many human diseases, as well as the maintenance of homeostasis. It is currently a challenge to even identify comprehensively the components of the human microbiota, although genomic approaches have greatly improved the feasibility of doing so. The study of the collective DNA (the human microbiome) of community members has been spurred by recent advances in DNA sequencing and other technologies; such technologies have created the new field of metagenomics (determining the DNA sequence of genomes from a mixed community of organisms). One issue associated with metagenomic studies is that analysis of mixed populations is extremely difficult due to the highly heterogeneous nature of the sample. To avoid this issue, efforts to isolate individual organisms from these mixed populations have been used. Unfortunately, obtaining sufficient quantities of pure, individual isolates is also a challenge. Growth of bulk cultures and purification of DNA for analysis is tedious and labor intensive, and many isolates are difficult, if not impossible, to culture. The utilization of whole-genome amplification methods on single-cell isolates to provide sufficient DNA for comprehensive downstream testing is a solution to this problem. To date, reliable whole-genome DNA amplification methods have failed to provide DNA suitable for analysis from trace samples due to reagent contamination, method sensitivity issues, and the generation of chimeric products that confuse analysis. As a collaborative effort between GE and the Broad Institute (a world leader in the implementation of new technologies to generate DNA sequence), the methodology developed will provide a process that can be used by high throughput sequencing facilities to completely characterize individual microbes using DNA sequencing, furthering the knowledge, and stimulating work, in this area. PUBLIC HEALTH RELEVANCE: There is a growing desire to understand more about the relationship human beings have with the microorganisms growing in and on their bodies (the human microbiome). DNA sequencing of the entire genome of each organism is one method being used to gather precise information about these microorganisms. The team plans to develop a whole-genome DNA amplification method with single cell sensitivity that will enable high-throughput DNA sequencing of entire genomes to be performed from single cells, eliminating the requirement to purify and culture each isolate.
Algorithmically-Tuned Protein Families, Rule_Base and Characterized Proteins
Analysis of the microbial communities present in or on the human body holds promise for explaining the dynamic basis of host-microbiome symbiosis and the contribution of these communities (the human microbiome) to health and disease. Vast amounts of metagenomic DNA sequence can be collected. However, current bioinformatics tools limit our ability to translate sequence into fundamentally new biomedical knowledge. There is a great need to improve existing tools and develop computational methods to address the complexity of data generated by human microbiome projects (HMP). This proposal takes a three-pronged approach to dramatically improve methods for extracting meaning from HMP sequence data. The first is to develop algorithms that build protein families, each family just inclusive enough that checking a genome for some cohort of families tells whether or not a pathway is present. These algorithms resemble Phylogenetic Profiling, a data mining technique, but go through optimization steps that guide the building of each family. Pre-built families are not required. The result is new descriptive power that can discover and describe new systems and pathways. Thousands of new families will be created. The second is a new way to apply annotation rules. Large numbers of rules created automatically, each of which works on fairly small numbers of proteins, can apply very exacting tests to determine whether one protein should be expected to have the same function as another that is already characterized. By deriving support from comparing gene regions or metabolic backgrounds in ways made possible only by having large numbers of complete genomes, these rules can achieve much greater confidence than more simplistic annotation techniques. The third is a systematic compilation of the right starting points for annotation. Annotation methods today are built to achieve maximum leverage from those few proteins whose functions are known for sure, but searching for those good anchors is surprisingly difficult, and searching repeatedly wasteful. The CHAR database will collect experimentally characterized proteins and make them "rule-ready" and universally available. All of the resources developed through this proposal will be made publicly available. These approaches combine to let us read metabolic properties from microbial genome sequences more accurately, and figure out better ways to fight disease.
New Tools for Understanding the Composition and Dynamics of Microbial Communities
The microbes that inhabit human bodies outnumber the human cells by an order of magnitude, and impact many aspects of health and disease including obesity, vaginosis, and Crohn's disease. Understanding this endogenous microbiota is emerging as a key extension of efforts to understand the human genome and the role of genetic variation on health and disease. The Human Microbiome Project (HMP) will characterize microbial communities in a large number of individual healthy humans using metagenomic sequencing. Consequently, new methods for interpreting sequence data to understand microbial community composition and dynamics are urgently needed. This project unites disciplines ranging from ecology to evolutionary biology to applied mathematics, to develop new methods for understanding which body habitats are more or less similar in terms of their microbial communities, by evaluating measures of microbial diversity and change, and creating needed new metrics of community composition. This will enable understanding of how clinically relevant parameters such as age, sex, or the pH of specific body habitats affect these communities, and of how the dynamics of change in microbial communities within an individual, in transmission between individuals, and in transmission between humans and the environment. This project is directly responsive to the Roadmap RFA for Development of New tools for Computational Analysis of Human Microbiome Project Data. The specific aims of this proposal are:
Aim 1. Develop, characterize, and apply enriched descriptors of microbial community diversity.
Aim 2. Develop methods for describing how human microbial communities vary over time and space.
Aim 3. Develop new methods for tracing the flow of organisms among different communities.
Some key aspects of the proposed work are: the development of new statistical methods for estimating microbial diversity within a body habitat; development of enriched methods for describing microbial community diversity; exhaustive validation of methods for comparing microbial communities through large-scale simulations and by using the largest available data sets that characterize microbial communities empirically; and the development of new methods for tracing the sources of the microbes that inhabit the human body using both marker genes and whole-metagenome data. Key outcomes include the ability to help determine the extent to which there is a core human microbiome, and how best to sample human microbial diversity. All methods developed will be made available under open source licenses and will be deposited with the HMP Data Analysis and Coordination Center (DACC). The investigators intend to work closely with other researchers involved in the HMP in order to ensure rapid progress.
Assembly and analysis software for exploring the human microbiome
Bacteria are the most abundant organisms on Earth, yet little is known about most members of this domain of life. Only about 1% of bacterial species can be easily grown in culture, and considerably fewer have been sequenced. Advances in sequencing technologies have made it possible to sequence bacteria directly from the environment, providing a dramatic new outlook on the diversity of bacteria populating our world. Initial studies have explored the bacteria present in mines, ocean water, and soil, as well as communities of commensal microbes that inhabit the human body. The latter have provided a glimpse at the complex symbiotic relationships between bacteria and their human hosts. Despite an increased interest in environmental sequencing (metagenomics), few specialized computational algorithms exist for the analysis of such data. For example, the assembly of environmental data is being performed with software originally intended for homogeneous DNA sources, such as clonal bacterial populations or inbred eukaryotes. These programs are ill-suited to the assembly of heterogeneous microbial communities and numerous "hacks" have been necessary to produce the assemblies published to date. This proposal aims to fill the need for specialized software for assembling and finding genes in metagenomic datasets. A particular focus will be on developing tools for uncovering genomic variation within the assemblies of microbial communities. The proposed software will specifically address issues arising from the use of new sequencing technologies in metagenomic projects. The low cost and high throughput of these technologies will allow a far deeper exploration of the microbial biosphere than was previously possible. Their broad application, however, depends on the availability of software systems adapted to their specific characteristics. In addition, new algorithms will be developed to allow the individual components of a metagenomic analysis pipeline to be tightly integrated, with the goal of improving the overall quality of both assembly and annotation, and to facilitate the extraction of other types of information from large sets of metagenomic data. The proposal further aims to investigate the impact of experimental design and choice of sequencing technology on the ability to assemble and analyze metagenomic data, through the development of software for simulating bacterial populations and emulating a variety sequencing strategies. Better experimental design can reduce the high costs currently associated with environmental sequencing and enhance subsequent analyses. All software developed as part of this proposal, as well as any simulated data and results of reanalyzing public datasets will be released freely through public databases and open-source software repositories.
Fragment assembly and metabolic/species diversity analysis for Human microbiome
The human microbiome contributes essential and complementary genetic and metabolic components to the host human. Until recently, microbiologists mainly studied individual culturable species of microbes, even though a vast majority (approximately 95%-98%) of microorganisms cannot live in pure culture. Facilitated by the rapid advancement of the DNA sequencing techniques, metagenomics attempts to directly determine the whole collection of genes within an environmental sample. To study the human microbiome at a global level, metagenomics becomes the methodology of choice for the Human Microbiome Project (HMP). We propose to develop computational methods addressing several challenges to the metagenomic analysis in HMP, namely, the assembly of short reads from pyrosequencing, the functional annotation of protein coding genes through database searching, and the characterization of the biodiversity in samples. We start with a novel approach to assembling short reads from metagenomics, called ORFome Assembly, by assembling putative ORFs from homologous proteins in the same family into a protein family graph (an Eulerian path approach). We then propose a network matching approach for the similarity search using the protein family graphs as queries. We anticipate that using protein family graphs will result in database searching with higher sensitivity and specificity than simply using unassembled sequencing reads. Finally, we propose to develop computational tools to simultaneously assess the biodiversity and biological functions in samples, by identifying the most likely set of coherent pathway variants covering the annotated gene functions within the metagenomic data based on the similarity search results. These software tools will enable researchers to efficiently and effectively analyze the data from HMP, which will enhance the understanding of the relationship between the human microbiota (i.e., the microbes living on the surface and inside human body) and human diseases, and hasten the development of better or new therapies.
High Performance Validation and Classification of Metagenomic Ribosomal-RNA Sequences
Innovations in culture-independent studies of environmental DNA sequences (i.e., metagenomics), coupled with rapidly advancing DNA sequencing capabilities, have altered profoundly the volume of sequence data that can be processed in a study. However several bottlenecks to metagenomic data analysis must be overcome as production is scaled up and findings are generalized. These include detection and culling of human and chimeric sequences; removal/correction of sequencing errors; accurate assessment of biodiversity; accurate taxonomic classification of sequences; and analysis of microbial eukaryotes in metagenomic specimens. Our overall objective is to build a framework for evaluating and insuring the quality of primary sequence data and associated phylogenetic metadata. Because rRNA-based phylogenetic analysis remains an essential means of organizing and interpreting the analyses of other metagenomic sequences, we focus in this proposed project on quality assurance issues related to rRNA sequence data. Specifically, we propose to build a software infrastructure based on a high-precision alignment tool (INFERNAL) that addresses many of the critical barriers to progress facing metagenomic research programs. Rigorous rRNA sequence alignment is a strict requirement for accurate sequence-based phylogenetic classification of microorganisms in metagenomic samples. The open-source INFERNAL alignment software developed by Prof. Sean Eddy (Co-Investigator) and colleagues permits a level of analysis that extends far beyond other widely-used automated sequence aligners. This base technology, developed to identify and annotate RNA genes in genomes in conjunction with the Rfam database, offers opportunity to develop and incorporate features that could significantly reduce current barriers to metagenomic analysis. INFERNAL uses consensus RNA primary and secondary structure (a covariance model; CM) to guide alignment. Calculation of position-specific measures of alignment uncertainty allows detection of poorly aligned sequences and alignment positions, which can be removed prior to downstream applications, for example phylogenetic inference. INFERNAL-based CM alignment can be used, therefore, as a sensitive mechanism for detecting and eliminating anomalous sequences (e.g., chimeras, non-rRNA sequences) and sequencing errors from datasets. In this two-year project, we propose a leveraged scheme in which the utility of the INFERNAL technology is adapted to the needs of the metagenomics community through joint development by the Pace and Eddy groups. In this proposal the Eddy lab (fully funded by HHMI) will continue to develop the core technology and functionality enhancements of INFERNAL, while the Pace lab (as funded by this grant) will use their extensive background in rRNA phylogenetic analyses to build and validate software tools that extend the basic feature set of INFERNAL, with special emphasis on facilitating research carried out in the Human Microbiome Project.
Exploiting Microbiome Sequences for Improved Models of Protein-DNA Interactions
This project will develop computer programs to exploit the Human Microbiome Project (HMP) DNA sequences to better understand DNA-protein interactions. The interactions between transcription factors and the DNA sites that they bind to are critical to controlling the expression of the genes within each species, and therefore also the characteristics of each species and its interactions with the human host. The transcription factors themselves can be readily identified from DNA sequences and we will take advantage of the fact that most bacterial transcription factors regulate themselves and/or adjacent genes within their chromosomes. Transcription factors can be clustered into groups that are expected to recognize the same patterns of DNA, based on known structures for similar proteins from well studied bacteria. Together the clusters of proteins with very similar specificity and the probable regulatory regions of nearby promoters will give us a very large number of potential DNA-protein interacting sites on which to apply pattern discovery algorithms. This should not only help us to learn about the regulatory networks within the HMP species, but also lead to more general understanding about the relationships between transcription factor proteins and the DNA patterns that they recognize. This will have broader implications across several areas of biological research and may lead to the design of new proteins with novel specificities that could be useful as research tools and for therapeutics.
Novel Computational Tools for Studying the Human Microbiome
The Human Microbiome Project will generate billions of high throughput sequence reads from rRNA gene PCR products and metagenomic DNA; these data have the potential to revolutionize our understanding of the microbial inhabitants of humans, the putative functions of these microbes, and their associations with health and disease. However, limitations in our ability to process this flood of data hinder our ability to make inferences or draw conclusions. Specifically, commonly available methods for identifying microbes from DNA or RNA sequences do not identify organisms to the species level, and may fail to perform confident assignment to the genus level or higher despite sufficient phylogenetic information to do so. As a result, many publicly available classification tools lump sequences representing distinct species into less specific taxonomic categories, as we have found when applying these tools to several novel bacteria linked with vaginal disease. This proposal is significant because it offers solutions to these fundamental problems by developing and refining novel computational tools; prototypes of these tools have already demonstrated significantly improved results. Our freely available software will help catalyze research on the human microbiome by increasing the speed, accuracy, and specificity of microbial identification, as well as offering methods for between-sample comparison. There are several innovative features of this proposal. First, computationally efficient maximum-likelihood phylogenetic placement of sequences on trees will provide a robust method for identifying microbes and distinguishing between novelty and uncertainty. Second, this proposal will provide accurately annotated collections of reference sequences that can facilitate classification of organisms present in major human body sites. More importantly, this proposal will develop software tools that will enable individual researchers to assemble sets of reference sequences using an approach that maximizes sequence diversity within each represented taxon while excluding poor quality and mislabeled sequences. Third, this proposal will develop new analysis and visualization tools to aid statistical comparison of microbial communities across space and time, and help capture these complex changes in intuitive visualizations.
Aim 1: Develop and optimize phylogenetic placement software for the analysis of 16S rRNA and other phylogenetically informative loci to better describe bacterial diversity and community composition. This aim will advance the development of our phylogenetic placement software pplacer, including the addition of algorithms for taxonomic annotation and species delineation, implementation of improved measures of uncertainty, and low-level code optimization.
Aim 2: Develop computational tools to curate project-specific sets of reference sequences from public repositories and local sources. This aim is motivated by our observation that appropriately selected reference sequences and accurate phylogenies are a critical and limiting component of the classification process.
Aim 3: Develop a software pipeline to integrate high throughput sequencing data analysis, including preprocessing, phylogenetic placement, statistical comparison, and phylogenetic visualization. This aim will result in two deliverables extending the capabilities of a broad spectrum of researchers: a web service for users who value simplicity, as well as R / Bioconductor software packages for users who value modularity, reproducibility, and extensibility.
Novel Methods for Effective Analysis, Assembly and Comparison of HMP Sequences
The human microbiota is thought to have profound influence on human health. The goal of the Human Microbiome Project (HMP) is to expand our understanding in human microbiome by generating reference microbiome genomes, identifying "core" genomes, studying their variation related to human health, and developing new technologies and informatics tools. Huge amounts of sequences in HMP have been generated utilizing metagenomics and next-generation sequencing technologies. It is becoming very challenging for existing resources and methods to manage and analyze the HMP data. The challenges are not only imposed by the huge volume but also by the great diversity and complexity of sequence data. To address these challenges, we propose several new computational methods to rapidly and effectively analyze very large HMP datasets.
(1) Consensus-based meta-assembler and pre-assembly processing. It is to significantly improve the assembly of metagenomic sequences. Instead of developing another assembly program, we will build a meta-assembler on top of available assemblers. We will also develop a pre-assembly protocol to filter and handle extra redundant and problematic sequences.
(2) Fast fragment recruitment and large-scale clustering. We plan to develop a fast program to align raw metagenomic reads to reference or homolog genomes. It is to fill the gaps between very fast but very stringent mapping programs (e.g. Bowtie), very slow but very sensitive aligning programs (e.g. BLAST), and fast but less sensitive ones (e.g. BLAT). We also plan to enable our clustering program CD-HIT to handle really large next-generation sequences.
(3) Dedicated utilities for annotation and comparison of metagenomes.
In recent years, we developed a HMM-based method for identification of rRNAs from raw reads, a fast method to identify artificial 454 duplicates, an automated workflow for metagenome annotation, a rapid and reliable reciprocal sequence comparing protocol, and a statistical method to compare many metagenomes with a unique visualization interface. We plan to improve these metagenomics- specific tools to achieve much better speed, performance and capability. The methods will be available as open source software, as web servers or both. We have obtained very promising preliminary results. The proposed tools will effectively help researchers in HMP data analysis. Other HMP related informatics tools in gene prediction, binning and assembly will greatly benefit from our proposed works.
Identifying Population-Level Variation In Cross-Sectional And Longitudinal HMP Studies
There is a fundamental gap in understanding the significance of how intra- and inter-personal variation in the structure of the human microbiome affects human phenotypes. Continued existence of this gap is problematic because it impedes the ability to relate this variation with changes in host health. Part of the problem is the over-reliance on microbiome-wide metrics of similarity instead of population-based metrics. This is similar to using microarray technology to compare the overall differences of E. coli gene expression in exponential versus stationary phase without addressing the change in expression of individual genes. Yet, a quantitative framework to aid in the analysis of population data from cross- sectional and longitudinal studies is lacking. The long-term goal is to understand the mechanisms that shape the structure and function of the human microbiome. The objective of this proposal is to develop robust computational tools that are optimized to analyze large sequence collections, yet are accessible to the typical investigator that is not an expert in bioinformatics. Specifically, this proposal will fulfill the stated need to develop computational tools that enable HMP-scientists to determine whether "variation in the microbiome at a site can be related to human phenotypes, such as disease." This proposal will develop robust computational tools that are optimized to analyze large sequence collections, yet are accessible to the typical investigator that is not an expert in bioinformatics. The rationale for this proposal is the imminent release of data from a number of HMP Demonstration Projects that are pursuing cross-sectional and longitudinal sampling, but have realized that they are limited in their ability to identify statistically robust linkages between specific changes in the microbiome with human phenotypes. Building upon extensive previous experience and interactions with HMP investigators, the objective will be achieved by pursuing three specific aims:
1) implement and disseminate computational tools in the mothur software package;
2) develop tools to correlate inter- subject variation in the microbiome with variation in health; and
3) develop tools to connect the dynamics of the microbiome with changes in health.
Each of the tools developed in the proposed research will be validated using simulated data and evaluated using HMP-generated sequence data. This research is innovative because it builds upon an already strong collection of tools for describing a community's "parts list" within the popular mothur software package and will create a robust set of statistical tools for assessing temporal variation and how that variation is related to health. The proposed research is significant because it will advance our ability to advance the goals of the HMP by relating community and population-level dynamics to changes in human health.
Functional Activity and Inter-Organismal Interactions in the Human Microbiome
High-throughput sequencing has provided a tool capable of observing the human microbiome, but characterizing the biological roles and metabolic potential of these microbial communities remains a significant challenge. Increasing evidence points to the functional activity of gene products, rather than community taxonomic composition, as the most robust descriptor of the microflora's relationship with its host and as a potential point of intervention in modulating human health. Existing computational tools for exploring a newly sequenced metagenome rely heavily on sequence homology and do not yet leverage information from the thousands of publicly available functional experimental results. Likewise, no previous methods have provided genome-scale computational tools for biological hypothesis generation regarding specific molecular interactions among the microflora and with a human host. This proposal aims to develop computational methodology to interpret the functional activity of microfloral communities:
1. Integrate functional information from taxonomic, metagenomic, and metatranscriptomic datasets. We will develop methodology to unify these three representations of microbiome composition by incorporating information from large scale functional genomic data collections.
2. Identify genomic predictors of inter-species functional activity, including host/microflora interactions and points of community-wide regulatory feedback. We will computationally screen microbiome assays for molecular interactions and regulatory motifs spanning multiple organisms in the community.
3. Implement these technologies as publicly available, accessible, and interpretable tools.
We will provide freely available, open source, downloadable and web-based implementations of this methodology for use by the bioinformatic and biological communities. As high-throughput sequencing becomes more widely used to study microbial communities in the human microbiome and in the environment, computational tools will be necessary to summarize their global functional activity and systems-level regulatory interactions. In the long term, by providing methodology to understand the human microbiome at the molecular level, we hope to enable its future use as a diagnostic indicator and as a point of intervention to improve human health.
