MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

📅 2026-05-21

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

Existing motion control methods for video generation often produce unnatural or implausible motions due to their rigid adherence to sparse, imprecise, and causally incomplete user-provided trajectories. To address this, this work proposes a two-stage “reasoning-generation” framework: first, a training-free vision-language model infers primary motion trajectories and hallucinates plausible secondary motions through visual reasoning; then, a confidence-aware mechanism dynamically modulates the strength of motion control to enhance both naturalness and causal consistency. The key contributions include the first use of visual reasoning to resolve causal gaps in motion control, a novel confidence-aware dynamic guidance strategy, and the introduction of MotiBench—the first interaction-centric image-to-video evaluation benchmark. Experiments demonstrate that the proposed method significantly outperforms existing approaches on MotiBench, achieving higher human preference ratings and vision-language model scores in terms of object behavior plausibility and interaction naturalness.

📝 Abstract

Current motion-controlled image-to-video generation models rigidly follow user-provided trajectories that are often sparse, imprecise, and causally incomplete. Such reliance often yields unnatural or implausible outcomes, especially by missing secondary causal consequences. To address this, we introduce MotiMotion, a novel framework that reformulates motion control as a reasoning-then-generation problem. To encourage causally grounded and commonsense-consistent interactions, we leverage a training-free vision-language reasoner to refine image-space coordinates of primary trajectories and to hallucinate plausible secondary motions. To further improve motion naturalness, we propose a confidence-aware control scheme that modulates guidance strength, enabling the model to closely follow high-confidence plans while correcting artifacts under low-confidence inputs with its internal generative priors. To support systematic evaluation, we curate a new image-to-video benchmark, MotiBench, consisting of interaction-centric scenes where new events are triggered by motion. Both VLM-based evaluation and a human study on MotiBench demonstrate that MotiMotion produces videos with more plausible object behaviors and interaction, and is preferred over existing approaches.

Problem

Research questions and friction points this paper is trying to address.

motion-controlled video generation

causal reasoning

visual commonsense

trajectory refinement

secondary motion

Innovation

Methods, ideas, or system contributions that make the work stand out.

motion-controlled video generation

visual reasoning

causal consistency