Platform for Representation and Integration of multimodal Molecular Embeddings

📅 2025-07-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing gene embedding methods are typically confined to single tasks or modalities, limiting their ability to comprehensively capture the biological diversity of gene functions and interactions. To address this, we propose PRISME—a unified multimodal gene representation framework that systematically integrates three heterogeneous data sources: omics experimental measurements, biomedical literature text, and structured knowledge graphs. PRISME introduces an enhanced Singular Vector Canonical Correlation Analysis (SVCCA) to quantitatively assess inter-modal redundancy and complementarity, thereby guiding autoencoder-based cross-modal fusion. Evaluated on multiple benchmark tasks—including missing value imputation and gene function prediction—PRISME consistently outperforms unimodal baselines and significantly enhances downstream biomedical model performance. By enabling interpretable, generalizable molecular intelligence, PRISME establishes a novel paradigm for integrative, multimodal representation learning in computational biology.

Technology Category

Application Category

📝 Abstract
Existing machine learning methods for molecular (e.g., gene) embeddings are restricted to specific tasks or data modalities, limiting their effectiveness within narrow domains. As a result, they fail to capture the full breadth of gene functions and interactions across diverse biological contexts. In this study, we have systematically evaluated knowledge representations of biomolecules across multiple dimensions representing a task-agnostic manner spanning three major data sources, including omics experimental data, literature-derived text data, and knowledge graph-based representations. To distinguish between meaningful biological signals from chance correlations, we devised an adjusted variant of Singular Vector Canonical Correlation Analysis (SVCCA) that quantifies signal redundancy and complementarity across different data modalities and sources. These analyses reveal that existing embeddings capture largely non-overlapping molecular signals, highlighting the value of embedding integration. Building on this insight, we propose Platform for Representation and Integration of multimodal Molecular Embeddings (PRISME), a machine learning based workflow using an autoencoder to integrate these heterogeneous embeddings into a unified multimodal representation. We validated this approach across various benchmark tasks, where PRISME demonstrated consistent performance, and outperformed individual embedding methods in missing value imputations. This new framework supports comprehensive modeling of biomolecules, advancing the development of robust, broadly applicable multimodal embeddings optimized for downstream biomedical machine learning applications.
Problem

Research questions and friction points this paper is trying to address.

Existing molecular embeddings are limited to specific tasks or data modalities
Current methods fail to capture full gene functions across diverse biological contexts
Lack of a unified framework for integrating multimodal molecular embeddings
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adjusted SVCCA for signal redundancy analysis
Autoencoder integrates heterogeneous molecular embeddings
PRISME unifies multimodal biomolecular representations
🔎 Similar Papers
No similar papers found.
E
Erika Yilin Zheng
Department of Physiology, University of California, Los Angeles
Y
Yu Yan
Medical Informatics HA, Department of Physiology, University of California, Los Angeles
B
Baradwaj Simha Sankar
Department of Physiology, University of California, Los Angeles
E
Ethan Ji
Department of Physiology, Department of Computer Science, University of California, Los Angeles
S
Steven Swee
Medical Informatics HA, Department of Physiology, University of California, Los Angeles
Irsyad Adam
Irsyad Adam
Medical Informatics PhD, UCLA
Knowledge GraphsGNNsMulti-Omics IntegrationMulti-Modal Fusion ModelsModel Explainability
D
Ding Wang
Department of Physiology, University of California, Los Angeles
A
Alexander Russell Pelletier
Department of Physiology, Department of Computer Science, University of California, Los Angeles
A
Alex Bui
Medical Informatics HA, Department of Computer Science, University of California, Los Angeles
W
Wei Wang
Medical Informatics HA, Department of Computer Science, University of California, Los Angeles
Peipei Ping
Peipei Ping
Professor of Physiology UCLA
cardiovascular medicineproteomicsdata science