🤖 AI Summary
Current computational benchmarks for universal machine learning force fields (UMLFFs) substantially overestimate their reliability across real chemical space, particularly when predicting experimentally measured properties. Method: We introduce UniFFBench—the first experimental-data-centric evaluation framework for universal force fields—and systematically assess 12 state-of-the-art UMLFFs on crystal geometry, elastic tensors, and density predictions for ~1,500 experimentally determined mineral structures. Contribution/Results: We propose the first multi-dimensional evaluation protocol grounded in experimental measurements. Our analysis reveals a significant decoupling between model stability and mechanical accuracy; moreover, training data coverage exerts a far stronger influence on prediction error than the choice of modeling methodology. Even the best-performing models exceed practical error thresholds—e.g., mean absolute error in elastic constants exceeds 15 GPa—demonstrating systematic bias in existing benchmarks.
📝 Abstract
Universal machine learning force fields (UMLFFs) promise to revolutionize materials science by enabling rapid atomistic simulations across the periodic table. However, their evaluation has been limited to computational benchmarks that may not reflect real-world performance. Here, we present UniFFBench, a comprehensive framework for evaluating UMLFFs against experimental measurements of ~1,500 carefully curated mineral structures spanning diverse chemical environments, bonding types, structural complexity, and elastic properties. Our systematic evaluation of six state-of-the-art UMLFFs reveals a substantial reality gap: models achieving impressive performance on computational benchmarks often fail when confronted with experimental complexity. Even the best-performing models exhibit higher density prediction error than the threshold required for practical applications. Most strikingly, we observe disconnects between simulation stability and mechanical property accuracy, with prediction errors correlating with training data representation rather than the modeling method. These findings demonstrate that while current computational benchmarks provide valuable controlled comparisons, they may overestimate model reliability when extrapolated to experimentally complex chemical spaces. Altogether, UniFFBench establishes essential experimental validation standards and reveals systematic limitations that must be addressed to achieve truly universal force field capabilities.