🤖 AI Summary
To address the challenges of poor physical interpretability and scarce supervision signals in visual trajectory prediction, this paper proposes a dynamics-driven latent-space world model. Built upon a variational autoencoder framework, the method integrates a parameterized physical dynamics model and, for the first time, introduces an interval-constrained weak-supervision mechanism—without requiring ground-truth physical annotations—to explicitly align latent variables with unknown system parameters (e.g., position and velocity). Compared to conventional end-to-end approaches, our model significantly improves both physical consistency and trajectory prediction accuracy. Extensive evaluations on multiple visual dynamic benchmarks demonstrate simultaneous gains in interpretability and generalization. This work establishes a novel paradigm for physics-guided representation learning.
📝 Abstract
Deep learning models are increasingly employed for perception, prediction, and control in complex systems. Embedding physical knowledge into these models is crucial for achieving realistic and consistent outputs, a challenge often addressed by physics-informed machine learning. However, integrating physical knowledge with representation learning becomes difficult when dealing with high-dimensional observation data, such as images, particularly under conditions of incomplete or imprecise state information. To address this, we propose Physically Interpretable World Models, a novel architecture that aligns learned latent representations with real-world physical quantities. Our method combines a variational autoencoder with a dynamical model that incorporates unknown system parameters, enabling the discovery of physically meaningful representations. By employing weak supervision with interval-based constraints, our approach eliminates the reliance on ground-truth physical annotations. Experimental results demonstrate that our method improves the quality of learned representations while achieving accurate predictions of future states, advancing the field of representation learning in dynamic systems.