Time-to-Move: Training-Free Motion Controlled Video Generation via Dual-Clock Denoising

📅 2025-11-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing diffusion-based video generation methods struggle to simultaneously achieve precise motion control and model generalizability: image- or text-conditioned approaches lack fine-grained motion controllability, while motion-conditioned methods typically require costly model fine-tuning, suffering from poor computational efficiency and limited generalization. This paper proposes a training-free, motion-controllable video generation framework that leverages coarse reference animations—e.g., cut-and-drag or depth-reprojection sequences—as motion priors, integrated with image-to-video (I2V) diffusion models and region-wise mask guidance. Its core innovations are a dual-clock denoising mechanism and a region-dependent alignment strategy, enabling pixel-level decoupling of motion and appearance within motion-specified regions while preserving generative freedom elsewhere. Evaluated on object- and camera-motion benchmarks, our method matches or surpasses fine-tuned baselines in visual quality, while offering plug-and-play deployment, zero-shot adaptation, zero training overhead, and full compatibility with mainstream I2V models.

Technology Category

Application Category

📝 Abstract
Diffusion-based video generation can create realistic videos, yet existing image- and text-based conditioning fails to offer precise motion control. Prior methods for motion-conditioned synthesis typically require model-specific fine-tuning, which is computationally expensive and restrictive. We introduce Time-to-Move (TTM), a training-free, plug-and-play framework for motion- and appearance-controlled video generation with image-to-video (I2V) diffusion models. Our key insight is to use crude reference animations obtained through user-friendly manipulations such as cut-and-drag or depth-based reprojection. Motivated by SDEdit's use of coarse layout cues for image editing, we treat the crude animations as coarse motion cues and adapt the mechanism to the video domain. We preserve appearance with image conditioning and introduce dual-clock denoising, a region-dependent strategy that enforces strong alignment in motion-specified regions while allowing flexibility elsewhere, balancing fidelity to user intent with natural dynamics. This lightweight modification of the sampling process incurs no additional training or runtime cost and is compatible with any backbone. Extensive experiments on object and camera motion benchmarks show that TTM matches or exceeds existing training-based baselines in realism and motion control. Beyond this, TTM introduces a unique capability: precise appearance control through pixel-level conditioning, exceeding the limits of text-only prompting. Visit our project page for video examples and code: https://time-to-move.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Existing video generation lacks precise motion control without expensive fine-tuning
Current methods fail to balance motion alignment with natural video dynamics
Text-based conditioning cannot provide precise appearance control in generated videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free plug-and-play motion control framework
Dual-clock denoising for region-dependent motion alignment
Crude animation integration without model fine-tuning
A
Assaf Singer
Technion – Israel Institute of Technology
Noam Rotstein
Noam Rotstein
Technion, Lumana.ai
AIMachine LearningComputer Vision
A
Amir Mann
Technion – Israel Institute of Technology
Ron Kimmel
Ron Kimmel
Prof. of CS (& ECE) Technion, CSO of Lumana, Israel
Image processingcomputer visionshape analysismedical imagingmetric geometry
O
O. Litany
Technion – Israel Institute of Technology, NVIDIA