🤖 AI Summary
This work investigates whether data distribution—not merely scale—is critical for deep learning models to acquire human-like intuitive physics reasoning. Motivated by the substantial performance gap between current large-scale models and humans on physics benchmarks (e.g., IntPhys2), we adopt a developmental psychology–inspired approach, using SAYCam—a first-person infant-vision video dataset—and pretrain a lightweight V-JEPA architecture. Despite leveraging only 0.01% of the data volume used by state-of-the-art models, performance gains in physical reasoning remain marginal. Our results reveal a fundamental learning bottleneck inherent to existing architectures, challenging the prevailing hypothesis that “massive video data alone suffices to induce human-level physical intuition.” We argue that progress hinges not on scaling data but on rethinking model inductive biases and representational mechanisms—particularly those enabling structured, causal, and compositional reasoning about physical dynamics.
📝 Abstract
Humans expertly navigate the world by building rich internal models founded on an intuitive understanding of physics. Meanwhile, despite training on vast quantities of internet video data, state-of-the-art deep learning models still fall short of human-level performance on intuitive physics benchmarks. This work investigates whether data distribution, rather than volume, is the key to learning these principles. We pretrain a Video Joint Embedding Predictive Architecture (V-JEPA) model on SAYCam, a developmentally realistic, egocentric video dataset partially capturing three children's everyday visual experiences. We find that training on this dataset, which represents 0.01% of the data volume used to train SOTA models, does not lead to significant performance improvements on the IntPhys2 benchmark. Our results suggest that merely training on a developmentally realistic dataset is insufficient for current architectures to learn representations that support intuitive physics. We conclude that varying visual data volume and distribution alone may not be sufficient for building systems with artificial intuitive physics.