IPR-1: Interactive Physical Reasoner

📅 2025-11-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates whether autonomous agents can acquire human-like physical and causal reasoning capabilities solely through environmental interaction. Method: We propose an interactive physical reasoning framework featuring PhysCode—a physics-aware action encoding space that unifies semantic intent with dynamical behavior—and integrate a vision-language model (VLM) policy, world-model-based forward rollout prediction, and policy-gradient reinforcement learning. The framework undergoes large-scale pretraining across 1,000+ heterogeneous games. Contribution/Results: Evaluated across survival, curiosity-driven, and utility-oriented benchmarks, the model demonstrates robust performance on diverse human-like physical reasoning tasks—matching GPT-5’s overall capability while significantly outperforming it on curiosity-driven tasks. Performance consistently improves with increasing interaction steps and game complexity, and the model exhibits strong zero-shot transferability to unseen environments.

Technology Category

Application Category

📝 Abstract
Humans learn by observing, interacting with environments, and internalizing physics and causality. Here, we aim to ask whether an agent can similarly acquire human-like reasoning from interaction and keep improving with more experience. We study this in a Game-to-Unseen (G2U) setting, curating 1,000+ heterogeneous games with diverse physical and causal mechanisms, and evaluate at three human-like levels: Survival, Curiosity, Utility, from primitive intuition to goal-driven reasoning. Our analysis reveals complementary failures: VLM/VLA agents reason but lack look-ahead in interactive settings, while world models imagine but imitate visual patterns rather than analyze physics and causality. We therefore propose IPR (Interactive Physical Reasoner), using world-model rollouts to score and reinforce a VLM's policy, and introduce PhysCode, a physics-centric action code aligning semantic intent with dynamics to provide a shared action space for prediction and reasoning. Pretrained on 1,000+ games, our IPR performs robustly on three levels, matches GPT-5 overall, and surpasses it on Curiosity. We find that performance improves with more training games and interaction steps, and that the model also zero-shot transfers to unseen games. These results support physics-centric interaction as a path to steadily improving physical reasoning.
Problem

Research questions and friction points this paper is trying to address.

Agents lack human-like physical reasoning from environmental interaction
World models imitate visuals rather than analyzing physics and causality
Need shared action space aligning semantic intent with physical dynamics
Innovation

Methods, ideas, or system contributions that make the work stand out.

World-model rollouts score and reinforce VLM policy
PhysCode aligns semantic intent with dynamics
Pretrained on 1000+ games for robust reasoning
🔎 Similar Papers
M
Mingyu Zhang
Shanghai Jiao Tong University
L
Lifeng Zhuo
Shanghai Jiao Tong University
T
Tianxi Tan
Shanghai Jiao Tong University
G
Guocan Xie
Shanghai Jiao Tong University
X
Xian Nie
Shanghai Jiao Tong University
Y
Yan Li
Shanghai Jiao Tong University
Renjie Zhao
Renjie Zhao
Shanghai Jiao Tong University
Z
Zizhu He
Shanghai Jiao Tong University
Z
Ziyu Wang
Shanghai Jiao Tong University
J
Jiting Cai
Carnegie Mellon University
Yong-Lu Li
Yong-Lu Li
Associate Professor, Shanghai Jiao Tong University/Shanghai Innovation Institute
Physical ReasoningRoboticsComputer VisionMachine LearningEmbodied AI