π€ AI Summary
Existing approaches struggle to effectively train AI models in human-like physical reasoning due to the high cost of annotating real-world videos or the insufficient realism of synthetic data. This work proposes a novel and scalable weakly supervised paradigm by leveraging physical anomalies (glitches) in gameplay videos as training signals. The authors introduce PhysGame, a large-scale instruction-tuning dataset, and GameBench, an expert-annotated evaluation benchmark. Through a metadata-guided strategy for generating high-quality question-answer pairs and integrating multimodal large language models with game video analysis, the method substantially enhances modelsβ understanding and generalization of physical laws. Experiments demonstrate consistent performance gains, with absolute improvements of 2.5%, 1.9%, and 3.7% on PhysBench, MVBench, and GameBench, respectively, significantly strengthening the modelβs ability to detect physically implausible scenarios.
π Abstract
Understanding the physical world, including object dynamics, material properties, and causal interactions, remains a core challenge in artificial intelligence. Although recent multi-modal large language models (MLLMs) have demonstrated impressive general reasoning capabilities, they still fall short of achieving human-level understanding of physical principles. Existing datasets for physical reasoning either rely on real-world videos, which incur high annotation costs, or on synthetic simulations, which suffer from limited realism and diversity. In this paper, we propose a novel paradigm that leverages glitches in gameplay videos, referring to visual anomalies that violate predefined physical laws, as a rich and scalable supervision source for physical world understanding. We introduce PhysGame, an meta information guided instruction-tuning dataset containing 140,057 glitch-centric question-answer pairs across five physical domains and sixteen fine-grained categories. To ensure data accuracy, we design a prompting strategy that utilizes gameplay metadata such as titles and descriptions to guide high-quality QA generation. Complementing PhysGame, we construct GameBench, an expert-annotated benchmark with 880 glitch-identified gameplay videos designed to evaluate physical reasoning capabilities. Extensive experiments show that PhysGame significantly enhances both Game2Real transferability, improving the real world physical reasoning performance of Qwen2.5VL by 2.5% on PhysBench, and Game2General transferability, yielding a 1.9% gain on the MVBench benchmark. Moreover, PhysGame-tuned models achieve a 3.7% absolute improvement on GameBench, demonstrating enhanced robustness in detecting physical implausibilities. These results indicate that learning from gameplay anomalies offers a scalable and effective pathway toward advancing physical world understanding in multimodal intelligence.