DepMicroDiff: Diffusion-Based Dependency-Aware Multimodal Imputation for Microbiome Data

📅 2025-07-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Microbiome data suffer from high sparsity and strong noise, severely limiting imputation accuracy and downstream biomarker discovery. Existing methods—including diffusion models—struggle to capture complex inter-taxa dependencies and neglect the guidance of auxiliary metadata. To address these limitations, we propose the first dependency-aware, multimodal diffusion imputation framework: (1) a Dependency-Aware Transformer explicitly models pairwise microbial dependencies and autoregressive structures; (2) clinical and phenotypic metadata, encoded via large language models, serve as conditional guidance; and (3) the framework integrates VAE-based pretraining with end-to-end diffusion generation. Evaluated on the TCGA pan-cancer microbiome dataset, our method achieves state-of-the-art performance: Pearson correlation coefficient = 0.712, cosine similarity = 0.812, and substantial reductions in RMSE and MAE. It further demonstrates superior robustness and generalizability across diverse cancer types.

Technology Category

Application Category

📝 Abstract
Microbiome data analysis is essential for understanding host health and disease, yet its inherent sparsity and noise pose major challenges for accurate imputation, hindering downstream tasks such as biomarker discovery. Existing imputation methods, including recent diffusion-based models, often fail to capture the complex interdependencies between microbial taxa and overlook contextual metadata that can inform imputation. We introduce DepMicroDiff, a novel framework that combines diffusion-based generative modeling with a Dependency-Aware Transformer (DAT) to explicitly capture both mutual pairwise dependencies and autoregressive relationships. DepMicroDiff is further enhanced by VAE-based pretraining across diverse cancer datasets and conditioning on patient metadata encoded via a large language model (LLM). Experiments on TCGA microbiome datasets show that DepMicroDiff substantially outperforms state-of-the-art baselines, achieving higher Pearson correlation (up to 0.712), cosine similarity (up to 0.812), and lower RMSE and MAE across multiple cancer types, demonstrating its robustness and generalizability for microbiome imputation.
Problem

Research questions and friction points this paper is trying to address.

Addressing sparsity and noise in microbiome data imputation
Capturing microbial interdependencies and contextual metadata
Improving accuracy and robustness for downstream biomarker discovery
Innovation

Methods, ideas, or system contributions that make the work stand out.

Diffusion-based generative modeling with Dependency-Aware Transformer
VAE-based pretraining across diverse cancer datasets
Patient metadata conditioning via large language model
Rabeya Tus Sadia
Rabeya Tus Sadia
PhD student at University of Kentucky
SpatialomicsVLMMachine LearningBioinformaticsImage Processing
Q
Qiang Cheng
Department of Computer Science, Institute for Biomedical Informatics, University of Kentucky