Opinion: Learning Intuitive Physics May Require More than Visual Data

📅 2025-12-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether data distribution—not merely scale—is critical for deep learning models to acquire human-like intuitive physics reasoning. Motivated by the substantial performance gap between current large-scale models and humans on physics benchmarks (e.g., IntPhys2), we adopt a developmental psychology–inspired approach, using SAYCam—a first-person infant-vision video dataset—and pretrain a lightweight V-JEPA architecture. Despite leveraging only 0.01% of the data volume used by state-of-the-art models, performance gains in physical reasoning remain marginal. Our results reveal a fundamental learning bottleneck inherent to existing architectures, challenging the prevailing hypothesis that “massive video data alone suffices to induce human-level physical intuition.” We argue that progress hinges not on scaling data but on rethinking model inductive biases and representational mechanisms—particularly those enabling structured, causal, and compositional reasoning about physical dynamics.

Technology Category

Application Category

📝 Abstract
Humans expertly navigate the world by building rich internal models founded on an intuitive understanding of physics. Meanwhile, despite training on vast quantities of internet video data, state-of-the-art deep learning models still fall short of human-level performance on intuitive physics benchmarks. This work investigates whether data distribution, rather than volume, is the key to learning these principles. We pretrain a Video Joint Embedding Predictive Architecture (V-JEPA) model on SAYCam, a developmentally realistic, egocentric video dataset partially capturing three children's everyday visual experiences. We find that training on this dataset, which represents 0.01% of the data volume used to train SOTA models, does not lead to significant performance improvements on the IntPhys2 benchmark. Our results suggest that merely training on a developmentally realistic dataset is insufficient for current architectures to learn representations that support intuitive physics. We conclude that varying visual data volume and distribution alone may not be sufficient for building systems with artificial intuitive physics.
Problem

Research questions and friction points this paper is trying to address.

Investigates if data distribution, not volume, enables learning intuitive physics.
Tests developmental video dataset's effect on model physics understanding.
Finds realistic data alone insufficient for artificial intuitive physics.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses developmental realistic egocentric video dataset
Employs Video Joint Embedding Predictive Architecture
Tests data distribution impact on intuitive physics learning
🔎 Similar Papers
No similar papers found.
E
Ellen Su
New York University
S
Solim LeGris
New York University
T
Todd M. Gureckis
New York University
Mengye Ren
Mengye Ren
NYU
Machine LearningComputer VisionArtificial Intelligence