Abundance-Aware Set Transformer for Microbiome Sample Embedding

📅 2025-08-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of effectively incorporating taxonomic abundance information into sample-level embeddings for microbiome data. We propose an abundance-aware Set Transformer aggregation method that requires no architectural modifications: instead, each taxon’s embedding vector is replicated proportionally to its relative abundance, enabling the self-attention mechanism to inherently perform abundance-weighted aggregation. Crucially, this approach integrates abundance information into the Transformer framework in a differentiable, parameter-free manner—marking the first such formulation—and yields fixed-dimensional, biologically interpretable sample embeddings. Evaluated on multiple real-world microbiome tasks—including host phenotype prediction and environmental classification—our method consistently outperforms mean pooling and standard Set Transformer baselines; in several cases, it achieves 100% accuracy. These results demonstrate that abundance-aware aggregation fundamentally enhances microbiome representation learning.

Technology Category

Application Category

📝 Abstract
Microbiome sample representation to input into LLMs is essential for downstream tasks such as phenotype prediction and environmental classification. While prior studies have explored embedding-based representations of each microbiome sample, most rely on simple averaging over sequence embeddings, often overlooking the biological importance of taxa abundance. In this work, we propose an abundance-aware variant of the Set Transformer to construct fixed-size sample-level embeddings by weighting sequence embeddings according to their relative abundance. Without modifying the model architecture, we replicate embedding vectors proportional to their abundance and apply self-attention-based aggregation. Our method outperforms average pooling and unweighted Set Transformers on real-world microbiome classification tasks, achieving perfect performance in some cases. These results demonstrate the utility of abundance-aware aggregation for robust and biologically informed microbiome representation. To the best of our knowledge, this is one of the first approaches to integrate sequence-level abundance into Transformer-based sample embeddings.
Problem

Research questions and friction points this paper is trying to address.

Improving microbiome sample representation for LLM input
Incorporating taxa abundance in embedding construction
Enhancing classification via abundance-aware Transformer embeddings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Abundance-aware Set Transformer for embeddings
Weight sequence embeddings by abundance
Self-attention aggregates replicated embeddings
H
Hyunwoo Yoo
Drexel University, Philadelphia, Pennsylvania, USA
Gail Rosen
Gail Rosen
Professor of ECE, Drexel University
BioinformaticsMetagenomicsGenomic Signal Processing