Inferring Dynamic Physical Properties from Video Foundation Models

📅 2025-10-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses temporal physical perception—inferring dynamic physical properties (e.g., elasticity, viscosity, kinetic friction coefficient) from video. We propose a prompt-learning-based cross-modal video understanding framework: (1) constructing the first synthetic+real video dataset explicitly designed for dynamic physical property estimation; (2) introducing learnable vision–physics prompt vectors and a cross-attention mechanism to uniformly adapt generative, self-supervised video foundation models, and multimodal large language models (MLLMs). Experiments show that generative and self-supervised models achieve comparable performance—significantly surpassing conventional methods—but remain below oracle upper bounds. Current MLLMs underperform, yet physics-guided prompt optimization substantially enhances their reasoning capability. Our contribution includes a new benchmark dataset, a unified prompting framework, and the first systematic evaluation of video-based physical property inference across diverse model families.

Technology Category

Application Category

📝 Abstract
We study the task of predicting dynamic physical properties from videos. More specifically, we consider physical properties that require temporal information to be inferred: elasticity of a bouncing object, viscosity of a flowing liquid, and dynamic friction of an object sliding on a surface. To this end, we make the following contributions: (i) We collect a new video dataset for each physical property, consisting of synthetic training and testing splits, as well as a real split for real world evaluation. (ii) We explore three ways to infer the physical property from videos: (a) an oracle method where we supply the visual cues that intrinsically reflect the property using classical computer vision techniques; (b) a simple read out mechanism using a visual prompt and trainable prompt vector for cross-attention on pre-trained video generative and self-supervised models; and (c) prompt strategies for Multi-modal Large Language Models (MLLMs). (iii) We show that video foundation models trained in a generative or self-supervised manner achieve a similar performance, though behind that of the oracle, and MLLMs are currently inferior to the other models, though their performance can be improved through suitable prompting.
Problem

Research questions and friction points this paper is trying to address.

Predicting dynamic physical properties from video sequences
Inferring elasticity, viscosity and friction from temporal information
Evaluating video foundation models for physical property estimation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Collect synthetic and real video datasets for physical properties
Use visual prompts with pre-trained video foundation models
Apply prompt strategies for Multi-modal Large Language Models
🔎 Similar Papers
No similar papers found.