🤖 AI Summary
Existing methods for inferring intrinsic object dynamics from visual observations either rely on hand-crafted priors—limiting generalizability—or employ black-box neural networks—sacrificing interpretability. To address this, we propose a bi-level optimization framework: the upper level leverages a large language model to generate and evolve constitutive laws, augmented with a decoupling mechanism to reduce search-space complexity; the lower level performs differentiable, vision-guided physical simulation to provide gradient-based feedback for refining dynamical expressions. Our approach discovers interpretable and generalizable explicit dynamical laws directly from video in an end-to-end manner. Extensive experiments demonstrate significant improvements over state-of-the-art methods on both synthetic and real-world datasets. Moreover, our model exhibits strong cross-scenario generalization and physically plausible behavior in interactive simulations.
📝 Abstract
The intrinsic dynamics of an object governs its physical behavior in the real world, playing a critical role in enabling physically plausible interactive simulation with 3D assets. Existing methods have attempted to infer the intrinsic dynamics of objects from visual observations, but generally face two major challenges: one line of work relies on manually defined constitutive priors, making it difficult to generalize to complex scenarios; the other models intrinsic dynamics using neural networks, resulting in limited interpretability and poor generalization. To address these challenges, we propose VisionLaw, a bilevel optimization framework that infers interpretable expressions of intrinsic dynamics from visual observations. At the upper level, we introduce an LLMs-driven decoupled constitutive evolution strategy, where LLMs are prompted as a knowledgeable physics expert to generate and revise constitutive laws, with a built-in decoupling mechanism that substantially reduces the search complexity of LLMs. At the lower level, we introduce a vision-guided constitutive evaluation mechanism, which utilizes visual simulation to evaluate the consistency between the generated constitutive law and the underlying intrinsic dynamics, thereby guiding the upper-level evolution. Experiments on both synthetic and real-world datasets demonstrate that VisionLaw can effectively infer interpretable intrinsic dynamics from visual observations. It significantly outperforms existing state-of-the-art methods and exhibits strong generalization for interactive simulation in novel scenarios.