MechVerse: Evaluating Physical Motion Consistency in Video Generation Models

📅 2026-05-14
📈 Citations: 0
Influential: 0
📄 PDF

career value

235K/year
🤖 AI Summary
Current image-to-video generation models struggle to accurately simulate mechanical motion governed by kinematic and geometric constraints, often exhibiting inconsistencies in rigidity preservation, component contact, and motion transmission. This work proposes MechVerse—the first benchmark dataset specifically designed for mechanical assembly scenarios—which systematically defines and quantifies mechanical motion consistency in video generation. The benchmark encompasses three levels of mechanism complexity and establishes a multi-tiered evaluation framework integrating synthetic data, structured prompts, standard video metrics, instruction-following scores, and human assessments of motion correctness. Experiments reveal that while state-of-the-art models maintain visual fidelity and temporal smoothness, they perform poorly in terms of mechanical plausibility, with error rates rising significantly as coupling complexity increases.
📝 Abstract
Text- and image-conditioned video generation models have achieved strong visual fidelity and temporal coherence, but they often fail to generate motion governed by kinematic and geometric constraints. In these settings, object parts must remain rigid, maintain contact or coupling with neighboring components, and transfer motion consistently across connected parts. These requirements are especially explicit in articulated mechanical assemblies, where motion is constrained by rigid-link geometry, contact/coupling relations, and transmission through kinematic chains. A generated video may therefore appear plausible while violating the intended mechanism, such as rotating a part that should translate, deforming a rigid component, breaking coupling between parts, or failing to move downstream components. To evaluate this gap, We introduce MechVerse, a benchmark for mechanically consistent image-to-video generation. MechVerse contains 21,156 synthetic clips from 1,357 mechanical assemblies across 141 categories, organized into three tiers of increasing kinematic complexity: independent articulation, pairwise coupling, and densely coupled multi-part mechanisms. Each clip is paired with a structured prompt describing part identities, stationary supports, moving components, motion primitives, direction, speed/extent, and inter-part dependencies. We evaluate proprietary, open-source, and fine-tuned image-to-video models using standard video metrics, instruction-following scores, and human judgments of motion correctness and kinematic coupling. Results show that current models can preserve appearance and smoothness while failing to generate mechanically admissible motion, with errors increasing as coupling complexity grows. MechVerse provides a benchmark for measuring and improving mechanism-aware video generation from image and language inputs.
Problem

Research questions and friction points this paper is trying to address.

physical motion consistency
video generation
kinematic constraints
mechanical assemblies
motion correctness
Innovation

Methods, ideas, or system contributions that make the work stand out.

mechanical consistency
video generation
kinematic constraints
structured prompting
motion evaluation benchmark