Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion

📅 2025-01-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address poor long-term temporal consistency and insufficient inter-frame coherence in fine-tuning-free long-video generation, this paper proposes a novel diffusion framework based on a circular (FIFO) queue. Methodologically: (1) it introduces tail-structure optimization in latent space sampling to enhance temporal structural modeling; (2) it designs subject-aware cross-frame attention (SACFA) to explicitly enforce subject consistency across frames; and (3) it incorporates a self-recursive guidance mechanism to enable global contextual feedback and modeling of hidden states. The framework requires no model fine-tuning—only a pre-trained text-to-video diffusion model and a FIFO noise queue—supporting arbitrary-length video generation. Evaluated on the VBench benchmark, it achieves significant improvements in subject consistency, motion smoothness, and temporal consistency. This work establishes an efficient, general-purpose paradigm for consistency modeling in long-video generation.

Technology Category

Application Category

📝 Abstract
The first-in-first-out (FIFO) video diffusion, built on a pre-trained text-to-video model, has recently emerged as an effective approach for tuning-free long video generation. This technique maintains a queue of video frames with progressively increasing noise, continuously producing clean frames at the queue's head while Gaussian noise is enqueued at the tail. However, FIFO-Diffusion often struggles to keep long-range temporal consistency in the generated videos due to the lack of correspondence modeling across frames. In this paper, we propose Ouroboros-Diffusion, a novel video denoising framework designed to enhance structural and content (subject) consistency, enabling the generation of consistent videos of arbitrary length. Specifically, we introduce a new latent sampling technique at the queue tail to improve structural consistency, ensuring perceptually smooth transitions among frames. To enhance subject consistency, we devise a Subject-Aware Cross-Frame Attention (SACFA) mechanism, which aligns subjects across frames within short segments to achieve better visual coherence. Furthermore, we introduce self-recurrent guidance. This technique leverages information from all previous cleaner frames at the front of the queue to guide the denoising of noisier frames at the end, fostering rich and contextual global information interaction. Extensive experiments of long video generation on the VBench benchmark demonstrate the superiority of our Ouroboros-Diffusion, particularly in terms of subject consistency, motion smoothness, and temporal consistency.
Problem

Research questions and friction points this paper is trying to address.

Video Generation
Coherence
FIFO Method
Innovation

Methods, ideas, or system contributions that make the work stand out.

Ouroboros-Diffusion
Theme-aware Inter-frame Attention
Self-cyclical Guidance
🔎 Similar Papers
No similar papers found.
J
Jingyuan Chen
University of Rochester, Rochester, NY USA
Fuchen Long
Fuchen Long
University of Science and Technology of China
Video Analysis
J
Jie An
University of Rochester, Rochester, NY USA
Zhaofan Qiu
Zhaofan Qiu
AI Research, JD.COM
Deep LearningComputer VisionMultimedia
T
Ting Yao
HiDream.ai Inc.
J
Jiebo Luo
University of Rochester, Rochester, NY USA
Tao Mei
Tao Mei
HiDream.ai; Fellow of CAE/IEEE/IAPR/CAAI
Multimedia AnalysisComputer VisionGenerative AIArtificial Intelligence