QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models

📅 2025-12-22

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work investigates whether vision-language models (VLMs) can perform *quantitative* physical reasoning—estimating kinematic attributes (e.g., size, velocity, acceleration) of moving objects from video—beyond qualitative physical judgment. To this end, we introduce **QuantiPhy**, the first benchmark for quantitative physical reasoning, comprising over 3.3K video-text samples with precise numerical ground truth. We formulate a prior-guided numerical regression task and design a unified prompting template alongside numerical evaluation metrics (MAE, RMSE). Key contributions include: (1) the first standardized evaluation protocol for physical *magnitudes*, not just categories or directions; (2) empirical evidence that state-of-the-art VLMs heavily rely on linguistic priors rather than multimodal (visual-auditory) signals for quantitative estimation; and (3) demonstration that their quantitative accuracy is weak—“qualitatively plausible” outputs do not imply “numerically correct”—and are highly sensitive to background noise, counterfactual priors, and prompt engineering, underscoring the urgent need for numerically grounded multimodal understanding.

Technology Category

Application Category

📝 Abstract

Understanding the physical world is essential for generalist AI agents. However, it remains unclear whether state-of-the-art vision perception models (e.g., large VLMs) can reason physical properties quantitatively. Existing evaluations are predominantly VQA-based and qualitative, offering limited insight into whether these models can infer the kinematic quantities of moving objects from video observations. To address this, we present QuantiPhy, the first benchmark designed to quantitatively measure a VLM's physical reasoning ability. Comprising more than 3.3K video-text instances with numerical ground truth, QuantiPhy evaluates a VLM's performance on estimating an object's size, velocity, and acceleration at a given timestamp, using one of these properties as an input prior. The benchmark standardizes prompts and scoring to assess numerical accuracy, enabling fair comparisons across models. Our experiments on state-of-the-art VLMs reveal a consistent gap between their qualitative plausibility and actual numerical correctness. We further provide an in-depth analysis of key factors like background noise, counterfactual priors, and strategic prompting and find that state-of-the-art VLMs lean heavily on pre-trained world knowledge rather than faithfully using the provided visual and textual inputs as references when reasoning kinematic properties quantitatively. QuantiPhy offers the first rigorous, scalable testbed to move VLMs beyond mere verbal plausibility toward a numerically grounded physical understanding.

Problem

Research questions and friction points this paper is trying to address.

Evaluates VLMs' quantitative physical reasoning from videos

Measures accuracy in estimating object size, velocity, acceleration

Assesses reliance on world knowledge versus visual-textual inputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

QuantiPhy benchmark quantitatively measures VLM physical reasoning

Standardized prompts and scoring assess numerical accuracy of kinematic properties

Evaluates VLMs using video-text instances with numerical ground truth

🔎 Similar Papers

No similar papers found.