🤖 AI Summary
This work addresses the limitations of traditional spatiotemporal physical system modeling, which relies on pixel-level next-frame prediction and suffers from error accumulation and high training costs, thereby hindering effective support for downstream scientific tasks such as physical parameter estimation. The study proposes evaluating general self-supervised learning methods based on their ability to yield physically meaningful representations, using downstream task performance—particularly physical parameter estimation—as a benchmark. By comparing objective functions rooted in pixel-space prediction versus those operating in latent spaces (e.g., Joint Embedding Predictive Architecture, JEPA), the authors demonstrate that latent-space modeling substantially enhances both the physical interpretability of learned representations and downstream task performance. Empirical results reveal that certain general-purpose self-supervised approaches outperform specialized physics-based models, highlighting their significant potential for scientific representation learning.
📝 Abstract
Machine learning approaches to spatiotemporal physical systems have primarily focused on next-frame prediction, with the goal of learning an accurate emulator for the system's evolution in time. However, these emulators are computationally expensive to train and are subject to performance pitfalls, such as compounding errors during autoregressive rollout. In this work, we take a different perspective and look at scientific tasks further downstream of predicting the next frame, such as estimation of a system's governing physical parameters. Accuracy on these tasks offers a uniquely quantifiable glimpse into the physical relevance of the representations of these models. We evaluate the effectiveness of general-purpose self-supervised methods in learning physics-grounded representations that are useful for downstream scientific tasks. Surprisingly, we find that not all methods designed for physical modeling outperform generic self-supervised learning methods on these tasks, and methods that learn in the latent space (e.g., joint embedding predictive architectures, or JEPAs) outperform those optimizing pixel-level prediction objectives. Code is available at https://github.com/helenqu/physical-representation-learning.