🤖 AI Summary
This work addresses temporal physical perception—inferring dynamic physical properties (e.g., elasticity, viscosity, kinetic friction coefficient) from video. We propose a prompt-learning-based cross-modal video understanding framework: (1) constructing the first synthetic+real video dataset explicitly designed for dynamic physical property estimation; (2) introducing learnable vision–physics prompt vectors and a cross-attention mechanism to uniformly adapt generative, self-supervised video foundation models, and multimodal large language models (MLLMs). Experiments show that generative and self-supervised models achieve comparable performance—significantly surpassing conventional methods—but remain below oracle upper bounds. Current MLLMs underperform, yet physics-guided prompt optimization substantially enhances their reasoning capability. Our contribution includes a new benchmark dataset, a unified prompting framework, and the first systematic evaluation of video-based physical property inference across diverse model families.
📝 Abstract
We study the task of predicting dynamic physical properties from videos. More specifically, we consider physical properties that require temporal information to be inferred: elasticity of a bouncing object, viscosity of a flowing liquid, and dynamic friction of an object sliding on a surface. To this end, we make the following contributions: (i) We collect a new video dataset for each physical property, consisting of synthetic training and testing splits, as well as a real split for real world evaluation. (ii) We explore three ways to infer the physical property from videos: (a) an oracle method where we supply the visual cues that intrinsically reflect the property using classical computer vision techniques; (b) a simple read out mechanism using a visual prompt and trainable prompt vector for cross-attention on pre-trained video generative and self-supervised models; and (c) prompt strategies for Multi-modal Large Language Models (MLLMs). (iii) We show that video foundation models trained in a generative or self-supervised manner achieve a similar performance, though behind that of the oracle, and MLLMs are currently inferior to the other models, though their performance can be improved through suitable prompting.