🤖 AI Summary
Existing MLIP benchmarks suffer from data leakage, poor transferability, and overreliance on single DFT functional–dependent error metrics, compromising evaluation fairness and physical consistency. To address these issues, we propose the first application-oriented, multidimensional evaluation framework grounded in physical principles: it introduces validation tasks targeting chemical reactivity, extreme-condition stability, and thermodynamic prediction; incorporates cross-system transferability testing; and employs a dynamic, functional-agnostic metric suite. Complementing the framework, we release an open-source Python toolkit and an online leaderboard to ensure reproducibility and transparency. Systematic evaluation across state-of-the-art MLIPs uncovers critical failure modes—such as breakdown under thermal excitation or chemical transformation—and establishes a robust, efficient, and physically self-consistent benchmark standard. This advances the accuracy–efficiency trade-off in MLIP development and provides actionable guidance for next-generation model design.
📝 Abstract
Machine learning interatomic potentials (MLIPs) have revolutionized molecular and materials modeling, but existing benchmarks suffer from data leakage, limited transferability, and an over-reliance on error-based metrics tied to specific density functional theory (DFT) references. We introduce MLIP Arena, a benchmark platform that evaluates force field performance based on physics awareness, chemical reactivity, stability under extreme conditions, and predictive capabilities for thermodynamic properties and physical phenomena. By moving beyond static DFT references and revealing the important failure modes of current foundation MLIPs in real-world settings, MLIP Arena provides a reproducible framework to guide the next-generation MLIP development toward improved predictive accuracy and runtime efficiency while maintaining physical consistency. The Python package and online leaderboard are available at https://github.com/atomind-ai/mlip-arena.