UltraViCo: Breaking Extrapolation Limits in Video Diffusion Transformers

📅 2025-11-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Video diffusion Transformers exhibit poor out-of-distribution generalization beyond their training sequence length, manifesting as periodic content repetition and degraded visual fidelity. This work identifies “attention dispersion”—the unintended activation of distant tokens outside the training context window—as the root cause, revealed for the first time through systematic analysis of attention maps. We propose a training-free, plug-and-play attention suppression method: during inference, we apply a constant exponential decay factor to explicitly attenuate attention scores for tokens beyond the trained context window. The method is architecture-agnostic and requires no fine-tuning. Experiments demonstrate that, under 4× length extrapolation, our approach improves motion dynamics by 233% and video quality by 40.5% over prior methods. Moreover, it seamlessly transfers to downstream tasks—including controllable video generation and editing—establishing a universal, efficient, and lightweight paradigm for long-video synthesis.

Technology Category

Application Category

📝 Abstract
Despite advances, video diffusion transformers still struggle to generalize beyond their training length, a challenge we term video length extrapolation. We identify two failure modes: model-specific periodic content repetition and a universal quality degradation. Prior works attempt to solve repetition via positional encodings, overlooking quality degradation and achieving only limited extrapolation. In this paper, we revisit this challenge from a more fundamental view: attention maps, which directly govern how context influences outputs. We identify that both failure modes arise from a unified cause: attention dispersion, where tokens beyond the training window dilute learned attention patterns. This leads to quality degradation and repetition emerges as a special case when this dispersion becomes structured into periodic attention patterns, induced by harmonic properties of positional encodings. Building on this insight, we propose UltraViCo, a training-free, plug-and-play method that suppresses attention for tokens beyond the training window via a constant decay factor. By jointly addressing both failure modes, we outperform a broad set of baselines largely across models and extrapolation ratios, pushing the extrapolation limit from 2x to 4x. Remarkably, it improves Dynamic Degree and Imaging Quality by 233% and 40.5% over the previous best method at 4x extrapolation. Furthermore, our method generalizes seamlessly to downstream tasks such as controllable video synthesis and editing.
Problem

Research questions and friction points this paper is trying to address.

Video diffusion transformers fail to generalize beyond training length
Attention dispersion causes quality degradation and content repetition
Proposed method suppresses attention for out-of-window tokens
Innovation

Methods, ideas, or system contributions that make the work stand out.

Addresses attention dispersion in video transformers
Uses constant decay factor for training-free extrapolation
Enables seamless generalization to downstream video tasks
🔎 Similar Papers
No similar papers found.
M
Min Zhao
Dept. of Comp. Sci. & Tech., BNRist Center, THU-Bosch ML Center, Tsinghua University
Hongzhou Zhu
Hongzhou Zhu
Tsinghua University
Generative Models
Y
Yingze Wang
Dept. of Comp. Sci. & Tech., BNRist Center, THU-Bosch ML Center, Tsinghua University
B
Bokai Yan
Gaoling School of Artificial Intelligence, Renmin University of China
Jintao Zhang
Jintao Zhang
Tsinghua University
Efficient MLMlsysSystem for AIMachine LearningDataBase
Guande He
Guande He
Ph.D. Student, University of Texas at Austin
Machine LearningFoundation ModelDeep Generative Models
Ling Yang
Ling Yang
Postdoc@Princeton University, PhD@Peking University
LLMDiffusion ModelsReinforcement LearningComplex Data Modeling
Chongxuan Li
Chongxuan Li
Associate Professor, Renmin University of China
Machine LearningGenerative ModelsDeep Learning
J
Jun Zhu
Dept. of Comp. Sci. & Tech., BNRist Center, THU-Bosch ML Center, Tsinghua University