$Δ$ynamics: Language-Based Representation for Inferring Rigid-Body Dynamics From Videos

📅 2026-05-19

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing methods for inferring rigid-body physical states from monocular videos suffer from limited generalization due to reliance on specific systems, object categories, or camera configurations. This work proposes the first vision–language framework that leverages structured textual descriptions as a unified representation, integrating optical flow inputs with natural language–based motion reasoning to automatically generate physics simulation configurations. By decoupling dynamics modeling from visual semantics through language-driven inference grounded in semantic-agnostic optical flow, the approach substantially enhances generalization to real-world scenes. On the CLEVRER benchmark, it achieves a segmentation IoU of 0.30—seven times higher than state-of-the-art vision–language models—and demonstrates further gains of 27% and 120% through test-time sampling and evolutionary search, respectively. The method successfully transfers to 235 real-world rigid-body videos without additional adaptation.

📝 Abstract

Inferring rigid-body physical states and properties from monocular videos is a fundamental step toward physics-based perception and simulation. Existing approaches assume specific underlying physical systems, object types, and camera poses, making them unable to generalize to complex real-world settings. We introduce $Δ$YNAMICS, a vision-language framework that uses language as a unified representation of rigid-body dynamics. Instead of directly predicting parameters, $Δ$YNAMICS generates scene configurations in a structured text format for physics simulation. We enhance the model's generalization by integrating natural language motion reasoning and leveraging optical flow as a semantic-agnostic input. On the CLEVRER dataset, $Δ$YNAMICS achieves a segmentation IoU of 0.30, a 7x improvement over leading VLMs (InternVL3-8B, Qwen2.5-VL-7B and Claude-4-Sonnet). Additionally, test-time sampling and evolutionary search further boost performance by 27% and 120% in segmentation IoU, respectively. Finally, we demonstrate strong transfer to a new dataset of 235 real-world rigid-body videos, highlighting the potential of language-driven physics inference for bridging perception and simulation.

Problem

Research questions and friction points this paper is trying to address.

rigid-body dynamics

monocular videos

physics-based perception

generalization

physical state inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language framework

language-based dynamics representation

structured text for physics simulation