MIMIC: A Generative Multimodal Foundation Model for Biomolecules

📅 2026-04-27
📈 Citations: 0
Influential: 0
📄 PDF

career value

191K/year
🤖 AI Summary
Biological macromolecule function is jointly governed by multimodal factors—including sequence, structure, regulation, evolution, and cellular context—yet existing foundation models are often confined to single modalities or fixed tasks. To address this limitation, we propose MIMIC, a generative multimodal foundation model built upon the newly curated, aligned dataset LORE. MIMIC employs a split-track encoder–decoder architecture that accepts any subset of observed modalities as conditional input, enabling unified generative modeling across nucleic acids, proteins, structures, evolutionary profiles, regulatory signals, and semantic context. The model supports conditional prediction, allele- and isoform-aware inference, and constrained sequence design, while dynamically incorporating experimental context as semantic conditioning. MIMIC achieves state-of-the-art performance on downstream tasks such as RNA splicing prediction, successfully designs non-reverting editing strategies to rescue HBB splicing mutations, and generates high-confidence protein sequences binding PD-L1 and hACE2.

Technology Category

Application Category

📝 Abstract
Biological function emerges from coupled constraints across sequence, structure, regulation, evolution, and cellular context, yet most foundation models in biology are trained within one modality or for a fixed forward task. We present MIMIC, a generative multimodal foundation model trained on our newly curated and aligned dataset, LORE, linking nucleic acid, protein, evolutionary, structural, regulatory, and semantic/contextual modalities within partially observed biomolecular states. MIMIC uses a split-track encoder-decoder architecture to condition on arbitrary subsets of observed modalities and reconstruct or generate missing components of molecular state across the genome, transcriptome, and proteome. Multimodal conditioning consistently improves MIMIC's sequence reconstruction relative to sequence-only inputs, while its learned representations enable state-of-the-art performance on RNA and protein downstream tasks. MIMIC achieves state-of-the-art splicing prediction, and its joint generative formulation enables isoform-aware inference that further improves performance. Beyond prediction, the same generative framework supports constrained design. For RNA, MIMIC identifies corrective edits in a clinically relevant HBB splice-disrupting mutation without reverting it by using evolutionary and structural signals. For proteins, jointly conditioning on shape and surface chemistry of PD-L1 and hACE2 binding sites produces diverse, high-confidence sequences with strong in silico support for target binding. Finally, MIMIC uses experimental context as semantic conditioning to model assay-dependent RNA chemical probing, rather than treating context as a fixed output. Together, these results position MIMIC's aligned multimodal generative modeling as a strong foundation for unifying representation learning, conditional prediction, and constrained biomolecular design within a single model.
Problem

Research questions and friction points this paper is trying to address.

multimodal
biomolecules
foundation model
generative modeling
molecular representation
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal foundation model
generative modeling
biomolecular design
conditional reconstruction
aligned biological modalities
🔎 Similar Papers
No similar papers found.
💼 Related Jobs
Siavash Golkar
Siavash Golkar
Research Scientist, New York University
Machine LearningArtificial IntelligenceTheoretical Physics
J
Jake Kovalic
Polymathic AI; Department of Applied Physics, Yale University
I
Irina Espejo Morales
Polymathic AI; Center for Data Science, New York University
Samuel Sledzieski
Samuel Sledzieski
Research Fellow at Flatiron Institute CCB
bioinformaticscomputational biologyprotein interactionmachine learningbiological networks
M
Minhuan Li
Center for Computational Mathematics, Flatiron Institute; Center for Computational Biology, Flatiron Institute
K
Ksenia Sokolova
Center for Computational Biology, Flatiron Institute
G
Geraud Krawezik
Polymathic AI; Scientific Computing Core, Flatiron Institute
Alberto Bietti
Alberto Bietti
Flatiron Institute, Simons Foundation
machine learningoptimizationstatistics
C
Claudia Skok Gibbs
Center for Data Science, New York University
R
Roman Klypa
Université Grenoble Alpes, CNRS, Grenoble INP, LJK
S
Shengwei Xiong
Department of Chemistry, New York University
F
Francois Lanusse
Polymathic AI; AIM, Université Paris-Saclay, Université Paris Cité
Liam Parker
Liam Parker
UC Berkeley / Polymathic AI
CosmologyAstrophysicsMachine Learning
Kyunghyun Cho
Kyunghyun Cho
New York University, Genentech
Machine LearningDeep Learning
Miles Cranmer
Miles Cranmer
University of Cambridge
Machine LearningAstrophysicsFluid Dynamics
T
Tom Hehir
Polymathic AI; Institute of Astronomy, University of Cambridge
Michael McCabe
Michael McCabe
Flatiron Institute
Machine learningcomputational scienceoptimizationnumerical analysis
Lucas Meyer
Lucas Meyer
The Forecasting Company
deep learninghigh performance computingphysics simulationtime series
R
Rudy Morel
Polymathic AI; Center for Computational Mathematics, Flatiron Institute
P
Payel Mukhopadhyay
Polymathic AI; Department of Applied Mathematics and Theoretical Physics, University of Cambridge
Mariel Pettee
Mariel Pettee
University of Wisconsin-Madison
Machine LearningHigh-Energy Particle PhysicsAstrophysics
H
Helen Qu
Polymathic AI; Center for Computational Astrophysics, Flatiron Institute; Department of Physics, New York University
J
Jeff Shen
Polymathic AI; Department of Astrophysical Sciences, Princeton University
David Fouhey
David Fouhey
New York University
Computer VisionMachine LearningAI for ScienceSolar Physics
H
Hadi Sotoudeh
Institute of Astronomy, University of Cambridge; Kavli Institute for Cosmology, University of Cambridge