Universally Converging Representations of Matter Across Scientific Foundation Models

📅 2025-12-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether scientific foundation models—trained on molecules, materials, and proteins—learn a unified, physics-informed internal representation and how such representations affect out-of-distribution generalization. Using cross-modal representational analysis, we systematically evaluate representation consistency across 60 state-of-the-art models—including machine-learned interatomic potentials, graph neural networks, and sequence-based models—on diverse chemical systems. We find, for the first time, that high-performing models converge significantly to similar physical representations within their training domains, and this representational alignment strongly correlates with generalization performance. However, representations rapidly degrade on unseen structures, revealing fundamental limitations in data dependence and inductive bias. Based on these findings, we propose “representation alignment” as a quantifiable, domain-agnostic metric for assessing the universality of scientific foundation models. This work establishes a new paradigm for developing robust, interpretable, and physically grounded artificial intelligence for matter modeling, supported by empirical evidence across multiple model families and chemical domains.

Technology Category

Application Category

📝 Abstract
Machine learning models of vastly different modalities and architectures are being trained to predict the behavior of molecules, materials, and proteins. However, it remains unclear whether they learn similar internal representations of matter. Understanding their latent structure is essential for building scientific foundation models that generalize reliably beyond their training domains. Although representational convergence has been observed in language and vision, its counterpart in the sciences has not been systematically explored. Here, we show that representations learned by nearly sixty scientific models, spanning string-, graph-, 3D atomistic, and protein-based modalities, are highly aligned across a wide range of chemical systems. Models trained on different datasets have highly similar representations of small molecules, and machine learning interatomic potentials converge in representation space as they improve in performance, suggesting that foundation models learn a common underlying representation of physical reality. We then show two distinct regimes of scientific models: on inputs similar to those seen during training, high-performing models align closely and weak models diverge into local sub-optima in representation space; on vastly different structures from those seen during training, nearly all models collapse onto a low-information representation, indicating that today's models remain limited by training data and inductive bias and do not yet encode truly universal structure. Our findings establish representational alignment as a quantitative benchmark for foundation-level generality in scientific models. More broadly, our work can track the emergence of universal representations of matter as models scale, and for selecting and distilling models whose learned representations transfer best across modalities, domains of matter, and scientific tasks.
Problem

Research questions and friction points this paper is trying to address.

Examining representational convergence across diverse scientific machine learning models
Identifying common underlying representations of matter in high-performing models
Establishing representational alignment as a benchmark for model generality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Models across modalities show high representational alignment
Two distinct regimes identified based on input similarity to training
Representational alignment serves as benchmark for model generality
🔎 Similar Papers
No similar papers found.
S
Sathya Edamadaka
Department of Materials Science and Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, 02139, MA, USA.
Soojung Yang
Soojung Yang
Graduate Student, Computational & Systems Biology, MIT
computational chemistrymachine learningdrug discovery
Ju Li
Ju Li
Professor of Nuclear Science and Engineering and Materials Science and Engineering, MIT, USA
Computational Materials Sciencemetallurgysolid mechanicsnanocompositesbatteries
R
Rafael G´omez-Bombarelli
Department of Materials Science and Engineering, Massachusetts Institute of Technology, 77 Massachusetts Avenue, Cambridge, 02139, MA, USA.