Fixing high sparsity in microbiome sequencing data | Genome Biology

Ruochen JiangWei Vivian Li;  Jingyi Jessica Li

mbImpute performs imputation on microbiome sequencing data and facilitates scientific discoveries.

Bacteria, archaea, fungi and viruses, these microorganisms play important roles in both ecosystems and human bodies. The development of high-throughput sequencing technologies has advanced microbiome studies in the last decade. Two sequencing technologies are primarily used: the 16S ribosomal RNA (rRNA) amplicon sequencing and the shotgun metagenomic sequencing. However, the microbiome data generated from both technologies suffer from high data sparsity. The prevalence of microbes’ zero abundances in samples hampers downstream analyses because many state-of-the-art analytical methods perform poorly on highly sparse data.

We propose the first imputation method, mbImpute, to alleviate the high sparsity issue in microbiome sequencing data. Our goal is to correct the likely non-biological zeros in microbiome data and thus to facilitate downstream analyses. One feature of microbiome sequencing data is the availability of side information including samples’ metadata and microbes’ phylogenetic distances. Unlike existing imputation methods designed for matrix completion and single-cell RNA-sequencing data, mbImpute can leverage samples’ metadata and microbes’ phylogenetic distances if available.  An important application of mbImpute is to increase the power of identifying differentially abundant microbes from microbiome data. In particular, we show that mbImpute empowers a popular identification method DESeq2+phyloseq. 



Microbiome studies have gained increased attention since many discoveries revealed connections between human microbiome compositions and diseases. A critical challenge in microbiome data analysis is the existence of many non-biological zeros, which distort taxon abundance distributions, complicate data analysis, and jeopardize the reliability of scientific discoveries. To address this issue, we propose the first imputation method for microbiome data---mbImpute---to identify and recover likely non-biological zeros by borrowing information jointly from similar samples, similar taxa, and optional metadata including sample covariates and taxon phylogeny. Comprehensive simulations verify that mbImpute achieves better imputation accuracy under multiple metrics, compared with five state-of-the-art imputation methods designed for non-microbiome data. In real data applications, we demonstrate that mbImpute improves the power of identifying disease-related taxa from microbiome data of type 2 diabetes and colorectal cancer, and mbImpute preserves non-zero distributions of taxa abundances.


Media Contact: 

Leticia Ortiz | Marketing & Communications | Building a community around data science in biomedicine​