MultiCOIN: Multi-Modal COntrollable Video INbetweening

📅 2025-10-09

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

Existing video frame interpolation methods suffer from inadequate modeling of complex motion, insufficient control over intermediate-frame details, and poor alignment with user intent. To address these limitations, we propose a multimodal controllable video frame interpolation framework. Our approach employs a dual-branch architecture to disentangle content and motion representations, introduces a unified sparse point-based representation compatible with diverse control signals—including depth maps, hierarchical masks, motion trajectories, text prompts, and target region localization—and adopts a Diffusion Transformer (DiT) backbone with dual-encoder-guided denoising, trained via a staged optimization strategy. Experiments demonstrate that our method significantly outperforms state-of-the-art approaches in dynamic fidelity, fine-grained controllability, and temporal-contextual consistency, enabling high-quality, long-duration, and semantically faithful video generation.

Technology Category

Application Category

📝 Abstract

Video inbetweening creates smooth and natural transitions between two image frames, making it an indispensable tool for video editing and long-form video synthesis. Existing works in this domain are unable to generate large, complex, or intricate motions. In particular, they cannot accommodate the versatility of user intents and generally lack fine control over the details of intermediate frames, leading to misalignment with the creative mind. To fill these gaps, we introduce modelname{}, a video inbetweening framework that allows multi-modal controls, including depth transition and layering, motion trajectories, text prompts, and target regions for movement localization, while achieving a balance between flexibility, ease of use, and precision for fine-grained video interpolation. To achieve this, we adopt the Diffusion Transformer (DiT) architecture as our video generative model, due to its proven capability to generate high-quality long videos. To ensure compatibility between DiT and our multi-modal controls, we map all motion controls into a common sparse and user-friendly point-based representation as the video/noise input. Further, to respect the variety of controls which operate at varying levels of granularity and influence, we separate content controls and motion controls into two branches to encode the required features before guiding the denoising process, resulting in two generators, one for motion and the other for content. Finally, we propose a stage-wise training strategy to ensure that our model learns the multi-modal controls smoothly. Extensive qualitative and quantitative experiments demonstrate that multi-modal controls enable a more dynamic, customizable, and contextually accurate visual narrative.

Problem

Research questions and friction points this paper is trying to address.

Generates smooth transitions between video frames

Enables multi-modal control over motion and content

Addresses limitations in complex motion generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Diffusion Transformer for video generation

Maps controls to sparse point-based representation

Separates motion and content into dual branches

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs