Register |Login






Technology & Tool Development

The HMP roadmap initiative calls for the development of new tools & technologies, informatics capabilities and resources needed for the advancement of the field of metagenomics. The data sets produced by metagenomic sequencing and related components will be very large and complex, requiring novel analytical tools for distilling useful information from vast amounts of sequence data, functional genomic data and subject metadata.

As well, whole genome sequencing technologies are currently limited to the relatively small class of microbes that can be cultured. In order to maximize the number of sequences available in the reference set, new techniques must be developed to culture or otherwise isolate for analysis currently unculturable organisms. In the long-term, methods for sequencing individual microbes or otherwise analyzing all of the members of complex populations will substantively advance this field.

HMP funded projects are presented here. More information can be found by clicking on each project. As these technologies are further described and new tools become public, we will make them available via this site.

Additional Tools, Online Resources & Protocols being used by the HMP Consortium can be found on the Tools & Protocols page.


Project Title Principal Investigator(s) Institution(s)
Species-by-Species Dissection of Microbiomes using Phage Display and Flow Sorting Cliff Han, Andrew Bradbury Los Alamos National Laboratory

Metagenomics is a new scientific discipline that has developed in the last several years. It is both a set of research techniques, comprising many related approaches and methods, and a research field. As a scientific field, metagenomics attempts to resolve four tiers of questions: 1) what micro-organisms are present in a particular complex microbiome, such as human gut? 2) in what proportions? 3) what they are doing? and 4) how will they react to environmental changes, such as a change in diet? Currently, the approach to answer these questions has been one of brute force shotgun sequencing and 16S sequence surveys. However these technologies can only hint as to which kinds of bacteria are present, and the information provided tends to be biased to the commonest species. The majority of bacteria in complex microbiomes cannot presently be cultured and sequenced in a conventional way. Whole genome amplification (WGA) from single cells has been used in several studies. However, in the best cases, only 60% of a genome can be covered with the DNA obtained with WGA from a single cell. Studies have showed that the bias from WGA is random and coverage can be improved by adding more copies of the same genome. We propose here to change the metagenomics paradigm: rather than extracting all DNA in bulk without any independent information on the species that comprise it, we propose to develop tools to be able to analyze species one by one. This will be carried out by using phage display to select antibodies that recognize species in the population, and then to use such selected antibodies to characterize the abundance of the species by flow cytometry, purify it, and if necessary deplete the population of the species in order to repeat the process. The purified bacteria will be used as starting material for whole genome amplification, species characterization by rRNA analysis, and sequencing, if necessary. The antibodies developed within this proposal will be used to carry out the analyses indicated. Those developed within the context of the analysis of the human gut microbiome will also be very useful within the context of clinical studies in which bacterial composition may play an etiological role. An artificial bacterial mixture of E. coli and several other bacterial species will be used at the first stage of method development, and the microbiota in human gut will be analyzed in the later portion of the project.

Targeted genomic characterization of uncultured bacteria from the human microbiot Mircea Podar UT-Battelle, LLC - Oak Ridge National Laboratory

A major goal of the Human Microbiome Project is to identify all of the organisms that are associated with the human body (the human microbiota) and determine the genomic sequence of most if not all of them. The detected diversity of the human microbiota reaches thousands of species and strains, the vast majority of which have not been isolated in pure culture. Our goal is to develop a robust and rapid approach for the targeted genomic characterization of any uncultured constituent of the human microbiota at single cell level and also to allow population genetic studies of selected groups of organisms that may have some cultured isolates. Our strategy utilizes the high phylogenetic resolution that the small subunit ribosomal RNA (SSU rRNA) provides in distinguishing microbial phylotypes. We plan to label and isolate single cells representing uncultured microbial lineages as well as populations of cells of specific phylotypes from complex microbiota samples and amplify their DNA to levels that enable genomic sequencing. This approach will bridge the gap between sequencing the limited number of individual cultured organisms and whole community shotgun sequencing (metagenomics) which generally does not provide sufficient depth and resolution to comprehensively sequence the microbiome. Initial feasibility studies indicate that our approach can be applied to any microbial consortia and is not dependent on the abundance of the target organism. Based on this, the focus of this proposal is to determine optimum experimental design and improved technical procedures for targeted single cell and population genomics of microbes from the human microbiota. The specific aims are to: 1. Aim 1. Separate single cells and populations of target uncultured microbial phylotypes from gut microbiota samples. We will use fluorescence in situ hybridization (FISH) combined with flow cytometry to obtain single cells and populations of targeted phylotypes, uncultured or with few cultured representatives 2. Aim 2. Amplification and sequencing of genomes from single cells representing the uncultured gut microbiota. We will amplify the genomes of target cells using multiple displacement amplification and sequence the DNA to obtain draft genomic assemblies. The experimental and computational approaches will be optimized for the human microbiota characteristics. 3. Aim 3. Pangenomic characterization of targeted populations of uncultured and cultured microbial phylotypes. We will isolate populations of specific bacterial phylotypes representing uncultured organisms as well as cell populations representing species/genera that have representatives in culture and one or few genomes sequenced. We will amplify and sequence the cell population genomic DNA to obtain composite genomes/pangenomes.

FISH 'N' Chips: A Microfluidic Processor for Isolating and Analyzing Microbes Anup K Singh Sandia Corp-Sandia National Laboratories

Since uncultivable microorganisms comprise a large percentage of the microbiome, and are likely to play a major role in the ecology at all sites within the body, it is critical to develop new approaches to obtain samples of these microorganisms for genomic analysis. In this proposal we focus on one anatomic site, the mouth, and propose to develop a technology to extract single bacterial cells from saliva. To attain these goals, we have formed a collaboration between Sandia (with expertise in integrated microfluidic technology for biological analysis), NYU College of Dentistry (with expertise in oral microbiomics and oral-based diagnostics), and the Joint Genome Institute (with expertise in microbial ecology and sequencing). The technological approach is to build an integrated microfluidic cell processor that will identify, select, and isolate into discrete microdroplets single bacteria from a mixture of oral bacteria from human saliva. The microfluidic processor will have multiple modules to 1) perform fluorescence in situ hybridization on a mixture of bacteria, 2) sort single cells using fluorescence activated photonic-force deflection, and 3) encapsulate sorted cells in microdroplets before depositing them on an array. The input to the device will be bacterial cells from saliva and the output will be arrayed droplets containing no more than one bacterium. We will first characterize and validate this processor using a mixture of pure bacterial cultures. Subsequently we will take salivary samples, deplete them of abundant bacterial species, and isolate individual cells, using specific 16S probes. Metagenomic analysis on the entire population of bacteria in saliva will be used to identify new bacterial sequences. With new sequence information, we will design 16S probes to isolate previously uncharacterized organisms for genomic testing. Isolated cells will be characterized as cultivable or non-cultivable, and known (sequenced) or unknown. Ultimately this technique will be used to extract sequence-quality genomic DNA from individual microorganism and can be used as a diagnostic to identify bacterial signatures obtained from healthy versus diseases subjects.

Functional Sorting of Microbial Cells From Complex Microbiota Mitchel Doktycz UT-Battelle, LLC - Oak Ridge National Laboratory

Microbial communities play a significant role in maintaining human health. However, understanding the complex relationship between a human host and its resident microbial flora presents a considerable challenge. For example, the total number of microbial cells residing in an individual is estimated to far outnumber the individual's somatic cells. Unfortunately, the identity, distribution, and functional significance of the majority of these microorganisms are unknown. The situation is further complicated by the inability to culture many of these organisms in the laboratory. Here, we propose the development of a microfabricated device that enables the parallel culturing and characterization of individual members of a microbial community. Single cells will first be encapsulated into alginate gel microdroplets to allow for small-scale growth of thousands of isolated cells in parallel. Segregation of a cell into a gel bead will also facilitate subsequent sorting and selection based on the metabolic profile of the cell, which will be assessed using fluorogenic enzyme substrates. By employing a panel of different substrates, a large number of different species can be distinguished based on their metabolic properties. This approach will allow for quantification of the relative distribution and functional capabilities of the different members of the consortia and for subsequent genetic analyses. The scale, throughput capabilities, and sensitivity of the proposed technology address the key challenges facing the analysis of microbial consortia. Demonstration of this "front-end" sample preparation technique will greatly facilitate subsequent genome sequencing and interpretation of the complex relationship between a human host and its resident microbial flora.

Multi-Dimensional Separation of Bacteria G. Scott Worthen Children's Hospital of Philadelphia

Efforts to understand the complex relationship between microbes and their hosts are complicated by the large number of nonculturable organisms, and the heterogeneity even within each species. While modern metagenomics approaches admirably sample the identity of microbes, bulk studies limit the other inferences that can be derived from each bacteria. In order to overcome this problem, yet retain clues to the diversity of the original population, we propose to: Model, design, build, and test a multidimensional, microfluidic sorter based on both structural (size-and shape and Electrophoresis) and functional (Adhesion, Chemotaxis) parameters to separate a complex bacterial mixture into bins containing bacteria that share common properties. The identities of the sorted bacteria will be obtained through metagenomic studies. The device will enable determination of the heterogeneity both between and within species in a complex mixture. This microfluidic separation device will utilize 1. Asymmetric pinched flow fractionation to separate bacteria based on size and shape. 2. Electrophoretic based flow fractionation to separate bacteria based on surface charge. 3. Functionalized magnetic beads to separate bacteria based on adhesion to extracellular matrix (ECM) components. 4. Chemotaxis to separate bacteria based on their motile response to chemical stimuli, and lastly, 5. Multi separation modalities to separate bacteria based on size, shape, adhesion, response to chemical stimuli, and surface charge. We will use microfluidic approaches since (i) the feature sizes of microfluidic systems are compatible with the size of the bacteria; (ii) complicated flow paths can be machined with ease and at low cost; (iii) many sorting modules utilizing diverse principles can be integrated into a single device. Within each specific aim we rely heavily on direct numerical simulation of particle movement using code that reflects 2-dimensional geometry. As part of the experimental plan, we will expand functionality of the our custom Particle Mover program to a full 3-D simulation. Once design parameters have been established, devices will be fabricated and tested rigorously using particles, mixtures of known bacteria, and for the 3 and 4-stage devices, complex mixtures from human subjects. We will make use of a modular architecture that facilitates interchangeability of modules. The long term goal is to add other separation modalities into the device and to integrate into the device modules for single cell isolation, DNA isolation and amplification on-chip, to permit high-throughput analysis of complex mixtures. These studies will lead to devices that not only capture the diversity of complex mixtures, but also permit direct assignment of the heterogeneity of structural and functional properties, genes and gene products within each single species in the mixture, and aid understanding of human disease.

An Integrated lab-on-chip system for genome sequencing of single microbial cells Yu-Hwa Lo, Kun Zhang University Of California San Diego

There are approximately 100 trillion microorganisms inhabited in a human body. The intimate interactions between these microorganisms and the host have a profound impact of the physiology of the human body. The majority of these microorganisms have yet to be fully characterized, mainly due to the difficulties in growing them in laboratory conditions. Here we propose the development of an efficient and scalable method to obtain genome sequences from single microbial cells. This will eliminate the need to obtain pure laboratory culture, thus allow us to systematically characterize the genome structure of microbial communities that resides in different parts of the human body. The proposed project contains three major components. 1. Development of low-cost and disposable cell sorting devices that integrate microfluidics with micro- scale optical components. Such a lab-on-chip cell sorter will be able to identify and isolate single microbial cells from samples that contain hundreds to thousands of different species and often are contaminated with free DNAs from the host and other sources. 2. Development of a micro-well based polymerase cloning device for simultaneous amplification of genomes from hundreds to thousands of single microbial cells in parallel. Using such a device, we will prepare sufficient amount of DNAs from single cells for whole genome shotgun sequencing, a critical step to obtain genome sequences from single cells. 3. Development of an integrate pipeline for dissecting the genome composition of human microbiome at the single cell resolution. We will test this pipeline by isolating and sequencing genome from approximately 35 single microbial cells at different levels of relative abundance from the mouse distal intestine. The propose method will provide the research community a new tool to identify unknown microbial species, to study their metabolic functions and to better understand the host-microbe interactions under various physiological and disease conditions.

SCODA DNA extraction to normalize species representation Andre Marziali Boreal Genomics Inc.

To perform a metagenomic analysis of the Human Microbiome, normalization of species abundances prior to library construction is required to allow for both cost effective sequencing by avoiding redundancy of over represented organisms, and also to allow the detection and sequencing of very low abundance species. As a result, there exists a need for a robust DNA purification method that can efficiently extract DNA while rejecting contaminants, that can length select during the extraction process, and that can enrich DNA pools for low abundance sequences to ensure that as many organisms as possible can be detected. We aim to apply our SCODA technology for concentrating and purifying nucleic acids to perform integrated extraction and fragment length selection in a single automated step, in order to reduce the time and labor required to produce clone libraries from contaminated samples. In addition, we aim to develop sequence- specific concentration of nucleic acids as a means to enrich DNA populations for very low abundance species, and for targeted recovery of low abundance genomes.

Optimization of a microfluidic device for single bacterial cell genomics David A Relman Stanford University

The quest to characterize the human microbiome is a daunting goal, but one that promises to enhance significantly our understanding of health and our management of a wide variety of disease states. In this quest, two features of the human microbiota in particular, pose major challenges: the large proportion and number of as-yet uncultivated species, and the extreme unevenness of the microbial communities, with a resulting large number of potentially important community members that fail to be "seen" in routine surveys. The ability to identify, isolate, and sequence the genome of single bacterial cells would allow us to characterize and understand both rare and uncultivated microbial species, and materially advance our understanding of the human microbiome. In recent work, a microfluidic device has been designed and fabricated, with features that mimic an integrated electrical circuit; this device isolates individual bacterial cells, and allows their genome to be amplified in nanoliter volumes. In this Application, a plan is proposed for optimization and augmentation of this microfluidics device, so that environmental contamination is reduced, rare cell types are more easily captured, larger numbers of cells are screened more quickly, and gene expression is more easily measured from single cells. The long-term objectives of this work are to enhance our understanding of the human microbial communities, and in particular, of novel or poorly-characterized, uncultivated microbial community members. This proposal responds to critical unmet needs posed by the NIH Human Microbiome Project. The following are the Specific Aims of this proposal: Aim 1. To reduce the contribution of environmental DNA to single cell genomic sequence data, and increase the "signal-to-noise" ratio of the sequence data obtained with our cell- sorting, genome amplification microfluidics device. The experimental approach involves the integration of optical (laser) tweezers into the device. Aim 2. To improve the ability to detect and capture rare microbial community members with the microfluidics device. The experimental approach involves the integration of fluorescence in situ hybridization techniques, specific probes, and fluorescence imaging with the microfluidics device. Aim 3. To increase the speed of single cell selection and isolation with the microfluidics device. The experimental approach involves more highly parallel microdevice designs, optimization of laser power and laser optical path, and further automation of cell manipulations. Aim 4. To enhance the capability for gene expression analysis in single bacterial cells. The experimental approach involves the development of on-chip protocols for RNA isolation, reverse transcription, and use of digital PCR to quantify transcript abundance from single cells.

Cultivation and Characterization of Microaerobes from the Human Microbiome Vincent B Young, Thomas Mitchell Schmidt Michigan State University

Complex communities of microbes are intimately associated with all plants and animals in nature: they influence the evolution, physiology and ecology of the host. The specific roles of microbes in these symbiotic relationships have been best elucidated for that subset of microbes grown in pure culture. However, the application of cultivation-independent molecular surveys reveals that many of these microbes have yet to be cultivated. We propose to cultivate a physiological subset of novel microbes from the human microbiome - the microaerobes - by incorporating unique approaches to isolation and cultivation. We are focusing on microaerobes because oxygen diffusing into the GI tract from host tissue creates a microoxic zone adjacent to the tissue that is likely to be colonized by microaerobes. As a result of the proximity to the host tissue, these microbes are likely to interact directly with the host and so are key to understanding the role of the microbiome in human health and disease. Microbes are typically isolated under an atmosphere of 21% oxygen or strictly anoxic conditions. While these conditions are suitable for the cultivation of many organisms, microaerobes thrive under reduced concentrations of oxygen. They have specialized respiratory enzymes to harvest oxygen at low concentrations, and as a result occupy niches not available to typical aerobes. Microaerobes, including populations of Helicobacter and Campylobacter, occupy the GI tracts of many animals. We propose to extend the availability of cultured microaerobes from the human microbiome as follows: 1) Exploit microoxic atmospheres and novel cultivation strategies to isolate microaerobes from the mucosa of the human GI tract, 2) Select representative microaerobes based on their distribution and abundance in the GI tract and their estimated prevalence in the human population, 3) Provide a physiological characterization of representative isolates that are sent for genome sequencing. Pure cultures of microaerophiles will not only enable direct sequencing of their genomes, but will be a valuable resource for genetic, biochemical and physiological experiments to test hypotheses generated from genome analyses. An improved understanding of the physiological ecology of these microbes and their impact on humans is most accessible when they are studied in pure and defined mixed cultures.

Technologies for the discovery of novel human colonic mucosal-associated microbes Eugene B Chang University Of Chicago

There are many challenges to the study of the human enteric microbiome, among them solving practical issues such as obtaining undistorted and representative samples and determining how advanced technologies for discovery of uncharacterized, seemingly uncultivable, and poorly represented microorganisms can be applied to small sample sizes. However, studies to date have also failed to recognize that colonic lavage used to prepare the colon for standard colonoscopy significantly dilutes and distorts the enteric microbiome. Moreover, our knowledge of the human enteric microbiome is heavily based on analyses of stool and luminal samples which may not be sufficiently representative of the more residential and geographically-specific communities of mucosal- associated bacteria. In this regard, novel and underrepresented species that have special conditional or communal properties that facilitate their close proximity to the host are likely to be missed. These organisms are likely to have direct bearing on human health and disease. Regional differences in their composition and community organization must also be factored in when studying the human enteric microbiome. In this proposal, these issues will be taken into consideration in developing new and improved non-cultivation-based technologies that will ultimately facilitate genomic sequencing and metagenomic analysis of substantial numbers of previously uncharacterized members of the human enteric microbiome. We propose to obtain region-specific samples of mucosal associated microbes in their natural state within the unprepped human colon. We will then develop and refine two non-cultivation- based approaches aimed at obtaining high grade, composite DNA of microbial communities or enriched/purified samples of underrepresented, unclassified microbial species. The first involves laser capture microdissection of mucosal-associated bacteria from different regions of the human colon, which will be developed primarily for generating high quality DNA for metagenomic analyses. The second approach involves fluorescence in situ hybridization (FISH) using 16S rDNA and metagenomically-determined unique riboprobes coupled with fluorescence-activated cell sorting (FACS). Yield, enrichment, and purity will be optimized to discover and isolate novel, unclassified, and rare microbes from the human colonic microbiome for whole genome sequencing by high throughput sequencing centers. We believe these studies will produce non-cultivation-based technologies that will advance genomic sequencing and metagenomic analysis of substantial numbers of previously uncharacterized members of the human enteric microbiome.

Project Title Principal Investigator(s) Institution(s)
Algorithmically-Tuned Protein Families, Rule_Base and Characterized Proteins Daniel H Haft J. Craig Venter Institute, Inc.

Analysis of the microbial communities present in or on the human body holds promise for explaining the dynamic basis of host-microbiome symbiosis and the contribution of these communities (the human microbiome) to health and disease. Vast amounts of metagenomic DNA sequence can be collected. However, current bioinformatics tools limit our ability to translate sequence into fundamentally new biomedical knowledge. There is a great need to improve existing tools and develop computational methods to address the complexity of data generated by human microbiome projects (HMP). This proposal takes a three-pronged approach to dramatically improve methods for extracting meaning from HMP sequence data. The first is to develop algorithms that build protein families, each family just inclusive enough that checking a genome for some cohort of families tells whether or not a pathway is present. These algorithms resemble Phylogenetic Profiling, a data mining technique, but go through optimization steps that guide the building of each family. Pre-built families are not required. The result is new descriptive power that can discover and describe new systems and pathways. Thousands of new families will be created. The second is a new way to apply annotation rules. Large numbers of rules created automatically, each of which works on fairly small numbers of proteins, can apply very exacting tests to determine whether one protein should be expected to have the same function as another that is already characterized. By deriving support from comparing gene regions or metabolic backgrounds in ways made possible only by having large numbers of complete genomes, these rules can achieve much greater confidence than more simplistic annotation techniques. The third is a systematic compilation of the right starting points for annotation. Annotation methods today are built to achieve maximum leverage from those few proteins whose functions are known for sure, but searching for those good anchors is surprisingly difficult, and searching repeatedly wasteful. The CHAR database will collect experimentally characterized proteins and make them "rule-ready" and universally available. All of the resources developed through this proposal will be made publicly available. These approaches combine to let us read metabolic properties from microbial genome sequences more accurately, and figure out better ways to fight disease.

New Tools for Understanding the Composition and Dynamics of Microbial Communities Robin Knight University of Colorado at Boulder

The microbes that inhabit human bodies outnumber the human cells by an order of magnitude, and impact many aspects of health and disease including obesity, vaginosis, and Crohn's disease. Understanding this endogenous microbiota is emerging as a key extension of efforts to understand the human genome and the role of genetic variation on health and disease. The Human Microbiome Project (HMP) will characterize microbial communities in a large number of individual healthy humans using metagenomic sequencing. Consequently, new methods for interpreting sequence data to understand microbial community composition and dynamics are urgently needed. This project unites disciplines ranging from ecology to evolutionary biology to applied mathematics, to develop new methods for understanding which body habitats are more or less similar in terms of their microbial communities, by evaluating measures of microbial diversity and change, and creating needed new metrics of community composition. This will enable understanding of how clinically relevant parameters such as age, sex, or the pH of specific body habitats affect these communities, and of how the dynamics of change in microbial communities within an individual, in transmission between individuals, and in transmission between humans and the environment. This project is directly responsive to the Roadmap RFA for Development of New tools for Computational Analysis of Human Microbiome Project Data. The specific aims of this proposal are: Aim 1. Develop, characterize, and apply enriched descriptors of microbial community diversity. Aim 2. Develop methods for describing how human microbial communities vary over time and space. Aim 3. Develop new methods for tracing the flow of organisms among different communities. Some key aspects of the proposed work are: the development of new statistical methods for estimating microbial diversity within a body habitat; development of enriched methods for describing microbial community diversity; exhaustive validation of methods for comparing microbial communities through large-scale simulations and by using the largest available data sets that characterize microbial communities empirically; and the development of new methods for tracing the sources of the microbes that inhabit the human body using both marker genes and whole-metagenome data. Key outcomes include the ability to help determine the extent to which there is a core human microbiome, and how best to sample human microbial diversity. All methods developed will be made available under open source licenses and will be deposited with the HMP Data Analysis and Coordination Center (DACC). The investigators intend to work closely with other researchers involved in the HMP in order to ensure rapid progress.

Assembly and analysis software for exploring the human microbiome Mihai Pop University of Maryland College Park

Bacteria are the most abundant organisms on Earth, yet little is known about most members of this domain of life. Only about 1% of bacterial species can be easily grown in culture, and considerably fewer have been sequenced. Advances in sequencing technologies have made it possible to sequence bacteria directly from the environment, providing a dramatic new outlook on the diversity of bacteria populating our world. Initial studies have explored the bacteria present in mines, ocean water, and soil, as well as communities of commensal microbes that inhabit the human body. The latter have provided a glimpse at the complex symbiotic relationships between bacteria and their human hosts. Despite an increased interest in environmental sequencing (metagenomics), few specialized computational algorithms exist for the analysis of such data. For example, the assembly of environmental data is being performed with software originally intended for homogeneous DNA sources, such as clonal bacterial populations or inbred eukaryotes. These programs are ill-suited to the assembly of heterogeneous microbial communities and numerous "hacks" have been necessary to produce the assemblies published to date. This proposal aims to fill the need for specialized software for assembling and finding genes in metagenomic datasets. A particular focus will be on developing tools for uncovering genomic variation within the assemblies of microbial communities. The proposed software will specifically address issues arising from the use of new sequencing technologies in metagenomic projects. The low cost and high throughput of these technologies will allow a far deeper exploration of the microbial biosphere than was previously possible. Their broad application, however, depends on the availability of software systems adapted to their specific characteristics. In addition, new algorithms will be developed to allow the individual components of a metagenomic analysis pipeline to be tightly integrated, with the goal of improving the overall quality of both assembly and annotation, and to facilitate the extraction of other types of information from large sets of metagenomic data. The proposal further aims to investigate the impact of experimental design and choice of sequencing technology on the ability to assemble and analyze metagenomic data, through the development of software for simulating bacterial populations and emulating a variety sequencing strategies. Better experimental design can reduce the high costs currently associated with environmental sequencing and enhance subsequent analyses. All software developed as part of this proposal, as well as any simulated data and results of reanalyzing public datasets will be released freely through public databases and open-source software repositories.

Fragment assembly and metabolic/species diversity analysis for Human microbiome Yuzhen Ye Indiana University Bloomington

The human microbiome contributes essential and complementary genetic and metabolic components to the host human. Until recently, microbiologists mainly studied individual culturable species of microbes, even though a vast majority (approximately 95%-98%) of microorganisms cannot live in pure culture. Facilitated by the rapid advancement of the DNA sequencing techniques, metagenomics attempts to directly determine the whole collection of genes within an environmental sample. To study the human microbiome at a global level, metagenomics becomes the methodology of choice for the Human Microbiome Project (HMP). We propose to develop computational methods addressing several challenges to the metagenomic analysis in HMP, namely, the assembly of short reads from pyrosequencing, the functional annotation of protein coding genes through database searching, and the characterization of the biodiversity in samples. We start with a novel approach to assembling short reads from metagenomics, called ORFome Assembly, by assembling putative ORFs from homologous proteins in the same family into a protein family graph (an Eulerian path approach). We then propose a network matching approach for the similarity search using the protein family graphs as queries. We anticipate that using protein family graphs will result in database searching with higher sensitivity and specificity than simply using unassembled sequencing reads. Finally, we propose to develop computational tools to simultaneously assess the biodiversity and biological functions in samples, by identifying the most likely set of coherent pathway variants covering the annotated gene functions within the metagenomic data based on the similarity search results. These software tools will enable researchers to efficiently and effectively analyze the data from HMP, which will enhance the understanding of the relationship between the human microbiota (i.e., the microbes living on the surface and inside human body) and human diseases, and hasten the development of better or new therapies.

High Performance Validation and Classification of Metagenomic Ribosomal-RNA Sequences. Norman Pace University of Colorado

Innovations in culture-independent studies of environmental DNA sequences (i.e., metagenomics), coupled with rapidly advancing DNA sequencing capabilities, have altered profoundly the volume of sequence data that can be processed in a study. However several bottlenecks to metagenomic data analysis must be overcome as production is scaled up and findings are generalized. These include detection and culling of human and chimeric sequences; removal/correction of sequencing errors; accurate assessment of biodiversity; accurate taxonomic classification of sequences; and analysis of microbial eukaryotes in metagenomic specimens. Our overall objective is to build a framework for evaluating and insuring the quality of primary sequence data and associated phylogenetic metadata. Because rRNA-based phylogenetic analysis remains an essential means of organizing and interpreting the analyses of other metagenomic sequences, we focus in this proposed project on quality assurance issues related to rRNA sequence data. Specifically, we propose to build a software infrastructure based on a high-precision alignment tool (INFERNAL) that addresses many of the critical barriers to progress facing metagenomic research programs. Rigorous rRNA sequence alignment is a strict requirement for accurate sequence-based phylogenetic classification of microorganisms in metagenomic samples. The open-source INFERNAL alignment software developed by Prof. Sean Eddy (Co-Investigator) and colleagues permits a level of analysis that extends far beyond other widely-used automated sequence aligners. This base technology, developed to identify and annotate RNA genes in genomes in conjunction with the Rfam database, offers opportunity to develop and incorporate features that could significantly reduce current barriers to metagenomic analysis. INFERNAL uses consensus RNA primary and secondary structure (a covariance model; CM) to guide alignment. Calculation of position-specific measures of alignment uncertainty allows detection of poorly aligned sequences and alignment positions, which can be removed prior to downstream applications, for example phylogenetic inference. INFERNAL-based CM alignment can be used, therefore, as a sensitive mechanism for detecting and eliminating anomalous sequences (e.g., chimeras, non-rRNA sequences) and sequencing errors from datasets. In this two-year project, we propose a leveraged scheme in which the utility of the INFERNAL technology is adapted to the needs of the metagenomics community through joint development by the Pace and Eddy groups. In this proposal the Eddy lab (fully funded by HHMI) will continue to develop the core technology and functionality enhancements of INFERNAL, while the Pace lab (as funded by this grant) will use their extensive background in rRNA phylogenetic analyses to build and validate software tools that extend the basic feature set of INFERNAL, with special emphasis on facilitating research carried out in the Human Microbiome Project.

Exploiting Microbiome Sequences for Improved Models of Protein-DNA Interactions. Gary Stormo Washington University

This project will develop computer programs to exploit the Human Microbiome Project (HMP) DNA sequences to better understand DNA-protein interactions. The interactions between transcription factors and the DNA sites that they bind to are critical to controlling the expression of the genes within each species, and therefore also the characteristics of each species and its interactions with the human host. The transcription factors themselves can be readily identified from DNA sequences and we will take advantage of the fact that most bacterial transcription factors regulate themselves and/or adjacent genes within their chromosomes. Transcription factors can be clustered into groups that are expected to recognize the same patterns of DNA, based on known structures for similar proteins from well studied bacteria. Together the clusters of proteins with very similar specificity and the probable regulatory regions of nearby promoters will give us a very large number of potential DNA-protein interacting sites on which to apply pattern discovery algorithms. This should not only help us to learn about the regulatory networks within the HMP species, but also lead to more general understanding about the relationships between transcription factor proteins and the DNA patterns that they recognize. This will have broader implications across several areas of biological research and may lead to the design of new proteins with novel specificities that could be useful as research tools and for therapeutics.

Novel Computational Tools for Studying the Human Microbiome. David Fredricks Fred Hutchinson Cancer Research Center

The Human Microbiome Project will generate billions of high throughput sequence reads from rRNA gene PCR products and metagenomic DNA; these data have the potential to revolutionize our understanding of the microbial inhabitants of humans, the putative functions of these microbes, and their associations with health and disease. However, limitations in our ability to process this flood of data hinder our ability to make inferences or draw conclusions. Specifically, commonly available methods for identifying microbes from DNA or RNA sequences do not identify organisms to the species level, and may fail to perform confident assignment to the genus level or higher despite sufficient phylogenetic information to do so. As a result, many publicly available classification tools lump sequences representing distinct species into less specific taxonomic categories, as we have found when applying these tools to several novel bacteria linked with vaginal disease. This proposal is significant because it offers solutions to these fundamental problems by developing and refining novel computational tools; prototypes of these tools have already demonstrated significantly improved results. Our freely available software will help catalyze research on the human microbiome by increasing the speed, accuracy, and specificity of microbial identification, as well as offering methods for between-sample comparison. There are several innovative features of this proposal. First, computationally efficient maximum-likelihood phylogenetic placement of sequences on trees will provide a robust method for identifying microbes and distinguishing between novelty and uncertainty. Second, this proposal will provide accurately annotated collections of reference sequences that can facilitate classification of organisms present in major human body sites. More importantly, this proposal will develop software tools that will enable individual researchers to assemble sets of reference sequences using an approach that maximizes sequence diversity within each represented taxon while excluding poor quality and mislabeled sequences. Third, this proposal will develop new analysis and visualization tools to aid statistical comparison of microbial communities across space and time, and help capture these complex changes in intuitive visualizations. Aim 1: Develop and optimize phylogenetic placement software for the analysis of 16S rRNA and other phylogenetically informative loci to better describe bacterial diversity and community composition. This aim will advance the development of our phylogenetic placement software pplacer, including the addition of algorithms for taxonomic annotation and species delineation, implementation of improved measures of uncertainty, and low-level code optimization. Aim 2: Develop computational tools to curate project-specific sets of reference sequences from public repositories and local sources. This aim is motivated by our observation that appropriately selected reference sequences and accurate phylogenies are a critical and limiting component of the classification process. Aim 3: Develop a software pipeline to integrate high throughput sequencing data analysis, including preprocessing, phylogenetic placement, statistical comparison, and phylogenetic visualization. This aim will result in two deliverables extending the capabilities of a broad spectrum of researchers: a web service for users who value simplicity, as well as R / Bioconductor software packages for users who value modularity, reproducibility, and extensibility.

Novel Methods for Effective Analysis, Assembly and Comparison of HMP Sequences. Weizhong Li University of California San Diego

The human microbiota is thought to have profound influence on human health. The goal of the Human Microbiome Project (HMP) is to expand our understanding in human microbiome by generating reference microbiome genomes, identifying "core" genomes, studying their variation related to human health, and developing new technologies and informatics tools. Huge amounts of sequences in HMP have been generated utilizing metagenomics and next-generation sequencing technologies. It is becoming very challenging for existing resources and methods to manage and analyze the HMP data. The challenges are not only imposed by the huge volume but also by the great diversity and complexity of sequence data. To address these challenges, we propose several new computational methods to rapidly and effectively analyze very large HMP datasets. (1) Consensus-based meta-assembler and pre-assembly processing. It is to significantly improve the assembly of metagenomic sequences. Instead of developing another assembly program, we will build a meta-assembler on top of available assemblers. We will also develop a pre-assembly protocol to filter and handle extra redundant and problematic sequences. (2) Fast fragment recruitment and large-scale clustering. We plan to develop a fast program to align raw metagenomic reads to reference or homolog genomes. It is to fill the gaps between very fast but very stringent mapping programs (e.g. Bowtie), very slow but very sensitive aligning programs (e.g. BLAST), and fast but less sensitive ones (e.g. BLAT). We also plan to enable our clustering program CD-HIT to handle really large next-generation sequences. (3) Dedicated utilities for annotation and comparison of metagenomes. In recent year, we developed a HMM-based method for identification of rRNAs from raw reads, a fast method to identify artificial 454 duplicates, an automated workflow for metagenome annotation, a rapid and reliable reciprocal sequence comparing protocol, and a statistical method to compare many metagenomes with a unique visualization interface. We plan to improve these metagenomics- specific tools to achieve much better speed, performance and capability. The methods will be available as open source software, as web servers or both. We have obtained very promising preliminary results. The proposed tools will effectively help researchers in HMP data analysis. Other HMP related informatics tools in gene prediction, binning and assembly will greatly benefit from our proposed works.

Identifying Population-Level Variation In Cross-Sectional And Longitudinal HMP Studies. Pat Schloss University of Michigan

There is a fundamental gap in understanding the significance of how intra- and inter-personal variation in the structure of the human microbiome affects human phenotypes. Continued existence of this gap is problematic because it impedes the ability to relate this variation with changes in host health. Part of the problem is the over-reliance on microbiome-wide metrics of similarity instead of population-based metrics. This is similar to using microarray technology to compare the overall differences of E. coli gene expression in exponential versus stationary phase without addressing the change in expression of individual genes. Yet, a quantitative framework to aid in the analysis of population data from cross- sectional and longitudinal studies is lacking. The long-term goal is to understand the mechanisms that shape the structure and function of the human microbiome. The objective of this proposal is to develop robust computational tools that are optimized to analyze large sequence collections, yet are accessible to the typical investigator that is not an expert in bioinformatics. Specifically, this proposal will fulfill the stated need to develop computational tools that enable HMP-scientists to determine whether "variation in the microbiome at a site can be related to human phenotypes, such as disease." This proposal will develop robust computational tools that are optimized to analyze large sequence collections, yet are accessible to the typical investigator that is not an expert in bioinformatics. The rationale for this proposal is the imminent release of data from a number of HMP Demonstration Projects that are pursuing cross-sectional and longitudinal sampling, but have realized that they are limited in their ability to identify statistically robust linkages between specific changes in the microbiome with human phenotypes. Building upon extensive previous experience and interactions with HMP investigators, the objective will be achieved by pursuing three specific aims: 1) implement and disseminate computational tools in the mothur software package; 2) develop tools to correlate inter- subject variation in the microbiome with variation in health; and 3) develop tools to connect the dynamics of the microbiome with changes in health. Each of the tools developed in the proposed research will be validated using simulated data and evaluated using HMP-generated sequence data. This research is innovative because it builds upon an already strong collection of tools for describing a community's "parts list" within the popular mothur software package and will create a robust set of statistical tools for assessing temporal variation and how that variation is related to health. The proposed research is significant because it will advance our ability to advance the goals of the HMP by relating community and population-level dynamics to changes in human health.

Functional Activity and Inter-Organismal Interactions in the Human Microbiome. Curtis Huttenhower Harvard University

High-throughput sequencing has provided a tool capable of observing the human microbiome, but characterizing the biological roles and metabolic potential of these microbial communities remains a significant challenge. Increasing evidence points to the functional activity of gene products, rather than community taxonomic composition, as the most robust descriptor of the microflora's relationship with its host and as a potential point of intervention in modulating human health. Existing computational tools for exploring a newly sequenced metagenome rely heavily on sequence homology and do not yet leverage information from the thousands of publicly available functional experimental results. Likewise, no previous methods have provided genome-scale computational tools for biological hypothesis generation regarding specific molecular interactions among the microflora and with a human host. This proposal aims to develop computational methodology to interpret the functional activity of microfloral communities: 1. Integrate functional information from taxonomic, metagenomic, and metatranscriptomic datasets. We will develop methodology to unify these three representations of microbiome composition by incorporating information from large scale functional genomic data collections. 2. Identify genomic predictors of inter-species functional activity, including host/microflora interactions and points of community-wide regulatory feedback. We will computationally screen microbiome assays for molecular interactions and regulatory motifs spanning multiple organisms in the community. 3. Implement these technologies as publicly available, accessible, and interpretable tools. We will provide freely available, open source, downloadable and web-based implementations of this methodology for use by the bioinformatic and biological communities. As high-throughput sequencing becomes more widely used to study microbial communities in the human microbiome and in the environment, computational tools will be necessary to summarize their global functional activity and systems-level regulatory interactions. In the long term, by providing methodology to understand the human microbiome at the molecular level, we hope to enable its future use as a diagnostic indicator and as a point of intervention to improve human health.