PhyVLLM: Physics-Guided Video Language Model with Motion-Appearance Disentanglement

๐Ÿ“… 2025-12-04
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Current video large language models (Video LLMs) suffer from limited performance in physical dynamic understanding due to reliance on appearance-based matching and absence of explicit physical modeling. To address this, we propose a motion-appearance disentangled dual-branch architecture: an appearance branch captures static semantics, while a motion branch employs Neural Ordinary Differential Equations (Neural ODEs) to learn continuous-time physical dynamics in a self-supervised, label-free manner. Our key contribution lies in embedding physical priors directly into temporal modeling and establishing a mapping from motion-aware features to the LLM token spaceโ€”thereby bridging low-level dynamics with high-level language reasoning. Extensive experiments demonstrate significant improvements over state-of-the-art methods on both physics reasoning and general video understanding benchmarks, validating that explicit physical modeling is both effective and essential for enhancing the deep reasoning capabilities of Video LLMs.

Technology Category

Application Category

๐Ÿ“ Abstract
Video Large Language Models (Video LLMs) have shown impressive performance across a wide range of video-language tasks. However, they often fail in scenarios requiring a deeper understanding of physical dynamics. This limitation primarily arises from their reliance on appearance-based matching. Incorporating physical motion modeling is crucial for deeper video understanding, but presents three key challenges: (1) motion signals are often entangled with appearance variations, making it difficult to extract clean physical cues; (2) effective motion modeling requires not only continuous-time motion representations but also capturing physical dynamics; and (3) collecting accurate annotations for physical attributes is costly and often impractical. To address these issues, we propose PhyVLLM, a physical-guided video-language framework that explicitly incorporates physical motion into Video LLMs. Specifically, PhyVLLM disentangles visual appearance and object motion through a dual-branch encoder. To model physical dynamics over time, we incorporate a Neural Ordinary Differential Equation (Neural ODE) module, which generates differentiable physical dynamic representations. The resulting motion-aware representations are projected into the token space of a pretrained LLM, enabling physics reasoning without compromising the model's original multimodal capabilities. To circumvent the need for explicit physical labels, PhyVLLM employs a self-supervised manner to model the continuous evolution of object motion. Experimental results demonstrate that PhyVLLM significantly outperforms state-of-the-art Video LLMs on both physical reasoning and general video understanding tasks, highlighting the advantages of incorporating explicit physical modeling.
Problem

Research questions and friction points this paper is trying to address.

Video LLMs struggle with physical dynamics understanding due to appearance-based matching limitations.
Motion signals are entangled with appearance variations, hindering clean physical cue extraction.
Collecting accurate physical attribute annotations is costly and often impractical.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-branch encoder disentangles appearance and motion
Neural ODE module models continuous physical dynamics
Self-supervised learning avoids costly physical annotations
๐Ÿ”Ž Similar Papers
No similar papers found.
Yu-Wei Zhan
Yu-Wei Zhan
Tsinghua University
X
Xin Wang
Department of Computer Science and Technology, Tsinghua University
H
Hong Chen
Department of Computer Science and Technology, Tsinghua University
Tongtong Feng
Tongtong Feng
Tsinghua University
Environment LearningAutonomous Embodied AIMultimedia Intelligence
W
Wei Feng
Department of Computer Science and Technology, Tsinghua University
R
Ren Wang
Department of Computer Science and Technology, Tsinghua University
G
Guangyao Li
Department of Computer Science and Technology, Tsinghua University
Q
Qing Li
Department of Electronic Engineering, Tsinghua University
Wenwu Zhu
Wenwu Zhu
Professor, Computer Science, Tsinghua Univerisity
Multimedia ComputingNetwork Representation Learning