RIFLEx: A Free Lunch for Length Extrapolation in Video Diffusion Transformers

๐Ÿ“… 2025-02-21
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
In video generation, long-sequence diffusion models often suffer from motion repetition and deceleration due to frequency mismatch in positional encoding. This work first identifies that the dominant intrinsic frequency of positional encoding governs temporal extrapolation behavior. Building on this insight, we propose a training-free inference-time frequency scaling methodโ€”reducing this dominant frequency enables high-fidelity sequence-length extrapolation without retraining. We further introduce a frequency-domain-driven positional encoding modulation scheme coupled with lightweight fine-tuning, enhancing dynamic fidelity while preserving motion consistency and diversity. Evaluated on state-of-the-art video diffusion models, our approach achieves 2ร— zero-shot temporal extrapolation and, with minimal fine-tuning, supports up to 3ร— extrapolation. It effectively suppresses temporal artifacts (e.g., looping and slowdown) and retains high-fidelity motion details. The method establishes an efficient, general-purpose paradigm for temporal extension in long-video generation.

Technology Category

Application Category

๐Ÿ“ Abstract
Recent advancements in video generation have enabled models to synthesize high-quality, minute-long videos. However, generating even longer videos with temporal coherence remains a major challenge, and existing length extrapolation methods lead to temporal repetition or motion deceleration. In this work, we systematically analyze the role of frequency components in positional embeddings and identify an intrinsic frequency that primarily governs extrapolation behavior. Based on this insight, we propose RIFLEx, a minimal yet effective approach that reduces the intrinsic frequency to suppress repetition while preserving motion consistency, without requiring any additional modifications. RIFLEx offers a true free lunch--achieving high-quality $2 imes$ extrapolation on state-of-the-art video diffusion transformers in a completely training-free manner. Moreover, it enhances quality and enables $3 imes$ extrapolation by minimal fine-tuning without long videos. Project page and codes: href{https://riflex-video.github.io/}{https://riflex-video.github.io/.}
Problem

Research questions and friction points this paper is trying to address.

Addresses video generation length limitations
Reduces repetition in extended videos
Enhances motion consistency without training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reduces intrinsic frequency
Preserves motion consistency
Training-free length extrapolation
๐Ÿ”Ž Similar Papers
No similar papers found.
M
Min Zhao
Dept. of Comp. Sci. & Tech., BNRist Center, THU-Bosch ML Center, Tsinghua University; ShengShu
Guande He
Guande He
Ph.D. Student, University of Texas at Austin
Machine LearningFoundation ModelDeep Generative Models
Y
Yixiao Chen
Dept. of Comp. Sci. & Tech., BNRist Center, THU-Bosch ML Center, Tsinghua University; ShengShu
Hongzhou Zhu
Hongzhou Zhu
Tsinghua University
Generative Models
Chongxuan Li
Chongxuan Li
Associate Professor, Renmin University of China
Machine LearningGenerative ModelsDeep Learning
J
Jun Zhu
Dept. of Comp. Sci. & Tech., BNRist Center, THU-Bosch ML Center, Tsinghua University; ShengShu; Pazhou Laboratory (Huangpu)