🤖 AI Summary
This work addresses the challenge of inferring invisible implicit physical properties—such as mass and charge—from video and leveraging them for dynamic prediction and physical causal reasoning. Existing methods struggle to disentangle compositional implicit attributes. To this end, the authors introduce ComPhy, the first synthetic dataset enabling fine-grained, disentangled physical attribute reasoning. They further propose PCR, a neuro-symbolic framework integrating video understanding, multi-object tracking, physics-simulation-driven learning, and multi-step causal question-answering. PCR enables cross-frame object association, joint modeling of explicit and implicit attributes, and both forward-looking and counterfactual dynamic prediction. Experiments demonstrate that PCR significantly outperforms state-of-the-art methods on both synthetic and real-world videos, achieving breakthroughs in implicit property inference, dynamic prediction accuracy, and solving complex physical reasoning tasks.
📝 Abstract
Understanding and reasoning about objects' physical properties in the natural world is a fundamental challenge in artificial intelligence. While some properties like colors and shapes can be directly observed, others, such as mass and electric charge, are hidden from the objects' visual appearance. This paper addresses the unique challenge of inferring these hidden physical properties from objects' motion and interactions and predicting corresponding dynamics based on the inferred physical properties. We first introduce the Compositional Physical Reasoning (ComPhy) dataset. For a given set of objects, ComPhy includes limited videos of them moving and interacting under different initial conditions. The model is evaluated based on its capability to unravel the compositional hidden properties, such as mass and charge, and use this knowledge to answer a set of questions. Besides the synthetic videos from simulators, we also collect a real-world dataset to show further test physical reasoning abilities of different models. We evaluate state-of-the-art video reasoning models on ComPhy and reveal their limited ability to capture these hidden properties, which leads to inferior performance. We also propose a novel neuro-symbolic framework, Physical Concept Reasoner (PCR), that learns and reasons about both visible and hidden physical properties from question answering. After training, PCR demonstrates remarkable capabilities. It can detect and associate objects across frames, ground visible and hidden physical properties, make future and counterfactual predictions, and utilize these extracted representations to answer challenging questions.