Mode Seeking meets Mean Seeking for Fast Long Video Generation

📅 2026-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of long video generation, which is hindered by the scarcity of high-quality long-duration training data and the difficulty of simultaneously preserving fine-grained local details and long-term structural coherence. The authors propose a decoupled training paradigm that employs a global Flow Matching head to learn narrative structures from limited long videos, while a local Distribution Matching head aligns sliding-window segments with a frozen short-video teacher model using mode-seeking reverse KL divergence. This approach uniquely integrates mode-seeking and mean-seeking mechanisms within a Decoupled Diffusion Transformer framework, enabling separate optimization of local realism and global consistency. Experiments demonstrate that the method generates minute-long videos with high fidelity, natural motion, and strong temporal coherence in very few sampling steps, effectively overcoming the traditional trade-off between visual quality and sequence length.

Technology Category

Application Category

📝 Abstract
Scaling video generation from seconds to minutes faces a critical bottleneck: while short-video data is abundant and high-fidelity, coherent long-form data is scarce and limited to narrow domains. To address this, we propose a training paradigm where Mode Seeking meets Mean Seeking, decoupling local fidelity from long-term coherence based on a unified representation via a Decoupled Diffusion Transformer. Our approach utilizes a global Flow Matching head trained via supervised learning on long videos to capture narrative structure, while simultaneously employing a local Distribution Matching head that aligns sliding windows to a frozen short-video teacher via a mode-seeking reverse-KL divergence. This strategy enables the synthesis of minute-scale videos that learns long-range coherence and motions from limited long videos via supervised flow matching, while inheriting local realism by aligning every sliding-window segment of the student to a frozen short-video teacher, resulting in a few-step fast long video generator. Evaluations show that our method effectively closes the fidelity-horizon gap by jointly improving local sharpness, motion and long-range consistency. Project website: https://primecai.github.io/mmm/.
Problem

Research questions and friction points this paper is trying to address.

long video generation
coherence
fidelity-horizon gap
data scarcity
temporal consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoupled Diffusion Transformer
Mode Seeking
Flow Matching
Long Video Generation
Reverse-KL Divergence