🤖 AI Summary
This work challenges the long-standing assumption that core physical knowledge—such as object permanence and shape consistency—must be innately hardwired, by investigating whether generic self-supervised video pretraining can spontaneously elicit intuitive physical understanding. We propose a joint learning framework combining masked video prediction with abstract representation learning (analogous to predictive coding), operating in latent space to model video dynamics without pixel-level reconstruction or explicit physical modeling. Physical reasoning is evaluated via expectation-violation paradigms. Trained for only one week on natural videos, our model significantly outperforms random baselines across multiple intuitive physics tasks—and approaches the performance of multimodal large language models. In contrast, purely pixel-based predictive models perform near chance, underscoring the critical role of latent-space dynamics modeling. To our knowledge, this is the first demonstration that intuitive physics can emerge above chance level solely through self-supervised video prediction coupled with abstract representation learning.
📝 Abstract
We investigate the emergence of intuitive physics understanding in general-purpose deep neural network models trained to predict masked regions in natural videos. Leveraging the violation-of-expectation framework, we find that video prediction models trained to predict outcomes in a learned representation space demonstrate an understanding of various intuitive physics properties, such as object permanence and shape consistency. In contrast, video prediction in pixel space and multimodal large language models, which reason through text, achieve performance closer to chance. Our comparisons of these architectures reveal that jointly learning an abstract representation space while predicting missing parts of sensory input, akin to predictive coding, is sufficient to acquire an understanding of intuitive physics, and that even models trained on one week of unique video achieve above chance performance. This challenges the idea that core knowledge -- a set of innate systems to help understand the world -- needs to be hardwired to develop an understanding of intuitive physics.