FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

📅 2024-08-15
🏛️ arXiv.org
📈 Citations: 7
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-video (T2V) models struggle to generate long-duration, motion-rich videos with strong temporal coherence, particularly in modeling implicit temporal logic within prompts and enabling frame-level fine-grained text guidance. To address this, we propose the Cross-Frame Text-Guided Module (CTGM), the first framework integrating a Temporal Information Injector (TII), a Temporal Affinity Refiner (TAR), and a Temporal Feature Booster (TFB) to achieve frame-specific text conditioning and dynamic temporal alignment. Built upon a diffusion-based architecture, our method incorporates latent-space temporal injection, cross-frame text-feature correlation recalibration, and implicit temporal consistency enhancement. Extensive experiments across multiple benchmarks demonstrate significant improvements in motion coherence and semantic fidelity. Both qualitative and quantitative evaluations consistently surpass state-of-the-art methods. The code, pre-trained models, and demonstration videos are publicly released.

Technology Category

Application Category

📝 Abstract
Synthesizing motion-rich and temporally consistent videos remains a challenge in artificial intelligence, especially when dealing with extended durations. Existing text-to-video (T2V) models commonly employ spatial cross-attention for text control, equivalently guiding different frame generations without frame-specific textual guidance. Thus, the model's capacity to comprehend the temporal logic conveyed in prompts and generate videos with coherent motion is restricted. To tackle this limitation, we introduce FancyVideo, an innovative video generator that improves the existing text-control mechanism with the well-designed Cross-frame Textual Guidance Module (CTGM). Specifically, CTGM incorporates the Temporal Information Injector (TII), Temporal Affinity Refiner (TAR), and Temporal Feature Booster (TFB) at the beginning, middle, and end of cross-attention, respectively, to achieve frame-specific textual guidance. Firstly, TII injects frame-specific information from latent features into text conditions, thereby obtaining cross-frame textual conditions. Then, TAR refines the correlation matrix between cross-frame textual conditions and latent features along the time dimension. Lastly, TFB boosts the temporal consistency of latent features. Extensive experiments comprising both quantitative and qualitative evaluations demonstrate the effectiveness of FancyVideo. Our video demo, code and model are available at https://360cvgroup.github.io/FancyVideo/.
Problem

Research questions and friction points this paper is trying to address.

Enhancing dynamic and consistent video generation from text
Improving temporal logic comprehension in video synthesis
Supporting both text-to-video and image-to-video generation tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-frame Textual Guidance Module for video generation
Temporal Information Injector enhances frame-specific text conditions
Temporal Affinity Refiner improves time-dimension correlation matrix
🔎 Similar Papers
No similar papers found.