🤖 AI Summary
This work addresses the challenge of effectively incorporating taxonomic abundance information into sample-level embeddings for microbiome data. We propose an abundance-aware Set Transformer aggregation method that requires no architectural modifications: instead, each taxon’s embedding vector is replicated proportionally to its relative abundance, enabling the self-attention mechanism to inherently perform abundance-weighted aggregation. Crucially, this approach integrates abundance information into the Transformer framework in a differentiable, parameter-free manner—marking the first such formulation—and yields fixed-dimensional, biologically interpretable sample embeddings. Evaluated on multiple real-world microbiome tasks—including host phenotype prediction and environmental classification—our method consistently outperforms mean pooling and standard Set Transformer baselines; in several cases, it achieves 100% accuracy. These results demonstrate that abundance-aware aggregation fundamentally enhances microbiome representation learning.
📝 Abstract
Microbiome sample representation to input into LLMs is essential for downstream tasks such as phenotype prediction and environmental classification. While prior studies have explored embedding-based representations of each microbiome sample, most rely on simple averaging over sequence embeddings, often overlooking the biological importance of taxa abundance. In this work, we propose an abundance-aware variant of the Set Transformer to construct fixed-size sample-level embeddings by weighting sequence embeddings according to their relative abundance. Without modifying the model architecture, we replicate embedding vectors proportional to their abundance and apply self-attention-based aggregation. Our method outperforms average pooling and unweighted Set Transformers on real-world microbiome classification tasks, achieving perfect performance in some cases. These results demonstrate the utility of abundance-aware aggregation for robust and biologically informed microbiome representation. To the best of our knowledge, this is one of the first approaches to integrate sequence-level abundance into Transformer-based sample embeddings.