Mimicking the Physicist's Eye:A VLM-centric Approach for Physics Formula Discovery

📅 2025-08-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing approaches to physical law discovery predominantly rely on unimodal data and neglect visual motion representations, limiting their capacity to model spatiotemporal patterns in dynamic phenomena. This work proposes VIPER-R1, the first framework to integrate vision-language models (VLMs) into symbolic physical formula discovery. VIPER-R1 jointly processes video observations, trajectory data, and symbolic reasoning to emulate the scientific workflow of observation–hypothesis–validation. Its key innovations include: (i) a causal chain-of-thought mechanism to guide physically grounded hypothesis generation; (ii) a symbolic residual realignment module for perturbation-sensitive symbolic calibration; and (iii) synergistic learning via motion-structure-guided curriculum training, reinforcement-learning-driven symbolic optimization, and collaborative inference with external symbolic regression tools. On the PhysSymbol benchmark, VIPER-R1 achieves state-of-the-art performance—significantly outperforming VLM baselines—in both formula accuracy and physical interpretability.

Technology Category

Application Category

📝 Abstract
Automated discovery of physical laws from observational data in the real world is a grand challenge in AI. Current methods, relying on symbolic regression or LLMs, are limited to uni-modal data and overlook the rich, visual phenomenological representations of motion that are indispensable to physicists. This "sensory deprivation" severely weakens their ability to interpret the inherent spatio-temporal patterns within dynamic phenomena. To address this gap, we propose VIPER-R1, a multimodal model that performs Visual Induction for Physics-based Equation Reasoning to discover fundamental symbolic formulas. It integrates visual perception, trajectory data, and symbolic reasoning to emulate the scientific discovery process. The model is trained via a curriculum of Motion Structure Induction (MSI), using supervised fine-tuning to interpret kinematic phase portraits and to construct hypotheses guided by a Causal Chain of Thought (C-CoT), followed by Reward-Guided Symbolic Calibration (RGSC) to refine the formula structure with reinforcement learning. During inference, the trained VIPER-R1 acts as an agent: it first posits a high-confidence symbolic ansatz, then proactively invokes an external symbolic regression tool to perform Symbolic Residual Realignment (SR^2). This final step, analogous to a physicist's perturbation analysis, reconciles the theoretical model with empirical data. To support this research, we introduce PhysSymbol, a new 5,000-instance multimodal corpus. Experiments show that VIPER-R1 consistently outperforms state-of-the-art VLM baselines in accuracy and interpretability, enabling more precise discovery of physical laws. Project page: https://jiaaqiliu.github.io/VIPER-R1/
Problem

Research questions and friction points this paper is trying to address.

Automated discovery of physical laws from observational data
Overcoming limitations of uni-modal methods by integrating visual data
Enabling precise interpretation of spatio-temporal patterns in dynamic phenomena
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates visual perception, trajectory data, and symbolic reasoning
Uses supervised fine-tuning with Causal Chain of Thought
Employs reinforcement learning for reward-guided symbolic calibration
🔎 Similar Papers
No similar papers found.