🤖 AI Summary
To address weak motion controllability in text-to-video generation, this paper proposes a training-free directional motion control framework. Methodologically, it introduces a plug-and-play Directional Motion Control module for precise trajectory guidance and, for the first time in diffusion-based video generation, incorporates a RAFT-based Motion Intensity Modulator to jointly decouple and regulate motion direction and intensity. The approach integrates diffusion modeling, cross-attention mechanisms, text-video conditional generation, and explicit motion feature injection. Experiments demonstrate state-of-the-art motion controllability across multiple benchmarks: trajectory alignment error is reduced by 37%, motion intensity supports five-level fine-grained adjustment, and inference speed is accelerated by 8.2× compared to fine-tuning-based alternatives.
📝 Abstract
Recent advancements in diffusion models have shown great promise in producing high-quality video content. However, efficiently training video diffusion models capable of integrating directional guidance and controllable motion intensity remains a challenging and under-explored area. To tackle these challenges, this paper introduces Mojito, a diffusion model that incorporates both motion trajectory and intensity control for text-to-video generation. Specifically, Mojito features a Directional Motion Control (DMC) module that leverages cross-attention to efficiently direct the generated object's motion without training, alongside a Motion Intensity Modulator (MIM) that uses optical flow maps generated from videos to guide varying levels of motion intensity. Extensive experiments demonstrate Mojito's effectiveness in achieving precise trajectory and intensity control with high computational efficiency, generating motion patterns that closely match specified directions and intensities, providing realistic dynamics that align well with natural motion in real-world scenarios.