MLIP Arena: Advancing Fairness and Transparency in Machine Learning Interatomic Potentials via an Open, Accessible Benchmark Platform

📅 2025-09-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing MLIP benchmarks suffer from data leakage, poor transferability, and overreliance on single DFT functional–dependent error metrics, compromising evaluation fairness and physical consistency. To address these issues, we propose the first application-oriented, multidimensional evaluation framework grounded in physical principles: it introduces validation tasks targeting chemical reactivity, extreme-condition stability, and thermodynamic prediction; incorporates cross-system transferability testing; and employs a dynamic, functional-agnostic metric suite. Complementing the framework, we release an open-source Python toolkit and an online leaderboard to ensure reproducibility and transparency. Systematic evaluation across state-of-the-art MLIPs uncovers critical failure modes—such as breakdown under thermal excitation or chemical transformation—and establishes a robust, efficient, and physically self-consistent benchmark standard. This advances the accuracy–efficiency trade-off in MLIP development and provides actionable guidance for next-generation model design.

Technology Category

Application Category

📝 Abstract
Machine learning interatomic potentials (MLIPs) have revolutionized molecular and materials modeling, but existing benchmarks suffer from data leakage, limited transferability, and an over-reliance on error-based metrics tied to specific density functional theory (DFT) references. We introduce MLIP Arena, a benchmark platform that evaluates force field performance based on physics awareness, chemical reactivity, stability under extreme conditions, and predictive capabilities for thermodynamic properties and physical phenomena. By moving beyond static DFT references and revealing the important failure modes of current foundation MLIPs in real-world settings, MLIP Arena provides a reproducible framework to guide the next-generation MLIP development toward improved predictive accuracy and runtime efficiency while maintaining physical consistency. The Python package and online leaderboard are available at https://github.com/atomind-ai/mlip-arena.
Problem

Research questions and friction points this paper is trying to address.

Addressing data leakage and limited transferability in MLIP benchmarks
Overcoming over-reliance on error metrics tied to DFT references
Revealing failure modes of foundation MLIPs in real-world applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates force fields using physics-aware metrics
Moves beyond static DFT references for assessment
Provides reproducible framework for MLIP development
🔎 Similar Papers
No similar papers found.
Yuan Chiang
Yuan Chiang
UC Berkeley, Lawrence Berkeley National Laboratory
geometric deep learningcomputational materials sciencematerials theoryAI for Science
T
Tobias Kreiman
UC Berkeley
C
Christine Zhang
UC Berkeley
M
Matthew C. Kuner
UC Berkeley, LBNL
E
Elizabeth Weaver
UC Berkeley
I
Ishan Amin
UC Berkeley
Hyunsoo Park
Hyunsoo Park
NCSOFT, Game AI Lab
game aievolutionary algorithmreinforcement algorithm
Y
Yunsung Lim
KAIST
Jihan Kim
Jihan Kim
KAIST
D
Daryl Chrzan
UC Berkeley, LBNL
Aron Walsh
Aron Walsh
Department of Materials, Imperial College London
Materials DesignSolid-State ChemistryAI for ScienceSolar Energy
S
Samuel M. Blau
LBNL
M
Mark Asta
UC Berkeley, LBNL
A
Aditi S. Krishnapriyan
UC Berkeley, LBNL