IPR-1: Interactive Physical Reasoner

📅 2025-11-19

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work investigates whether autonomous agents can acquire human-like physical and causal reasoning capabilities solely through environmental interaction. Method: We propose an interactive physical reasoning framework featuring PhysCode—a physics-aware action encoding space that unifies semantic intent with dynamical behavior—and integrate a vision-language model (VLM) policy, world-model-based forward rollout prediction, and policy-gradient reinforcement learning. The framework undergoes large-scale pretraining across 1,000+ heterogeneous games. Contribution/Results: Evaluated across survival, curiosity-driven, and utility-oriented benchmarks, the model demonstrates robust performance on diverse human-like physical reasoning tasks—matching GPT-5’s overall capability while significantly outperforming it on curiosity-driven tasks. Performance consistently improves with increasing interaction steps and game complexity, and the model exhibits strong zero-shot transferability to unseen environments.

Technology Category

Application Category

📝 Abstract

Humans learn by observing, interacting with environments, and internalizing physics and causality. Here, we aim to ask whether an agent can similarly acquire human-like reasoning from interaction and keep improving with more experience. We study this in a Game-to-Unseen (G2U) setting, curating 1,000+ heterogeneous games with diverse physical and causal mechanisms, and evaluate at three human-like levels: Survival, Curiosity, Utility, from primitive intuition to goal-driven reasoning. Our analysis reveals complementary failures: VLM/VLA agents reason but lack look-ahead in interactive settings, while world models imagine but imitate visual patterns rather than analyze physics and causality. We therefore propose IPR (Interactive Physical Reasoner), using world-model rollouts to score and reinforce a VLM's policy, and introduce PhysCode, a physics-centric action code aligning semantic intent with dynamics to provide a shared action space for prediction and reasoning. Pretrained on 1,000+ games, our IPR performs robustly on three levels, matches GPT-5 overall, and surpasses it on Curiosity. We find that performance improves with more training games and interaction steps, and that the model also zero-shot transfers to unseen games. These results support physics-centric interaction as a path to steadily improving physical reasoning.

Problem

Research questions and friction points this paper is trying to address.

Agents lack human-like physical reasoning from environmental interaction

World models imitate visuals rather than analyzing physics and causality

Need shared action space aligning semantic intent with physical dynamics

Innovation

Methods, ideas, or system contributions that make the work stand out.

World-model rollouts score and reinforce VLM policy

PhysCode aligns semantic intent with dynamics

Pretrained on 1000+ games for robust reasoning

🔎 Similar Papers

Closed Loop Interactive Embodied Reasoning for Robot Manipulation