HMGC - Clustered Gene Indices

We have created a non-redundant catalog of bacterial genes by body site. This was done by clustering the HMP gene indices (gene predictions from the 690 samples that passed QC), using the same identity parameter used by the Metahit project to cluster their human intestinal tract data (Qin et al, 2010. Nature, 464:59). Genes were compared at a 95% identity cut-off and those overlapping were considered redundant and removed. The non-redundant gene set contains a total of 15,006,602 genes across all 15 body sites.

Attention: the gene counts available here differ from what was originally reported in the publication. This is due to the fact that samples that did not pass our assembly QC had not been removed prior to initial clustering. Clustering has been rerun and the data is now accurate. We apologize for any inconveniences this may have caused our users.

Gene clusters were annotated for carbohydrate-active enzymes, using the CAZy database. This annotation is available here, as a gff3 file.

Member Organizations