🤖 AI Summary
This work addresses a critical limitation in evaluating physics foundation models, which are often assessed solely on average metrics that fail to distinguish whether models have learned universal physical laws or merely fit the training distribution. To this end, the authors introduce a bias-aware benchmark encompassing eight physical dynamics, three training data mixing strategies, and 25 test scenarios, enabling systematic evaluation of model generalization under in-distribution, distribution-shift, and out-of-distribution settings. Through large-scale experiments across five architectures and four variants—totaling 60,000 trials—the study reveals for the first time that physics foundation models act as conditional rather than universal generalizers. Their performance is highly sensitive to physical mechanisms, temporal scales, initial conditions, pretraining strategies, model scale, and architecture, demonstrating that simply scaling up data or model size cannot fundamentally resolve generalization biases.
📝 Abstract
Recent physics foundation models claim general spatiotemporal forecasting ability, yet their evaluations often collapse performance into a single average score under a fixed training distribution. This makes it difficult to determine whether a model has learned generalizable physical dynamics or only performs well under particular settings. We construct a benchmark with 8 physical dynamics, 3 training-data mixtures, and 25 test regimes induced by dynamic-scale and initial-condition complexity shifts, covering in-distribution, distribution-shift, and out-of-distribution settings. We evaluate five physics foundation model architectures and four model variants per architecture (scratch and three pretrained sizes), resulting in 60,000 measurements. Our results show that current physics foundation models behave as conditional rather than universal generalists: their generality depends on the physical regime, temporal scale, initial-condition setting, pretraining, model size, and architecture. Improving the training data distribution only partially mitigates this limitation. Pretraining and scaling are also unable to reliably remove their ability biases. We argue that improving physics foundation models requires moving beyond scaling models or expanding data, toward learning mechanisms that better capture transferable physical knowledge across regimes, temporal scales, and distribution shifts.