Long Video Diffusion Generation with Segmented Cross-Attention and Content-Rich Video Data Curation

📅 2024-12-02

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

194K/year

🤖 AI Summary

To address the dual challenges of diminishing scene diversity and deteriorating semantic coherence in long-duration video generation, this paper introduces Presto—the first model enabling high-quality, controllable 15-second video synthesis. Methodologically, Presto proposes a parameter-free Segmented Cross-Attention (SCA) mechanism to strengthen long-range temporal modeling; builds upon the DiT architecture; integrates multi-granularity text-alignment training with long-video data distillation; and introduces LongTake-HD, the first high-quality dataset specifically designed for long-video generation (261k samples, featuring holistic and five-stage progressive captions). Experiments demonstrate that Presto achieves a semantic score of 78.5% and perfect motion diversity (100%) on VBench, significantly outperforming state-of-the-art methods. It establishes breakthroughs in both long-range temporal consistency and fine-grained text controllability.

Technology Category

Application Category

📝 Abstract

We introduce Presto, a novel video diffusion model designed to generate 15-second videos with long-range coherence and rich content. Extending video generation methods to maintain scenario diversity over long durations presents significant challenges. To address this, we propose a Segmented Cross-Attention (SCA) strategy, which splits hidden states into segments along the temporal dimension, allowing each segment to cross-attend to a corresponding sub-caption. SCA requires no additional parameters, enabling seamless incorporation into current DiT-based architectures. To facilitate high-quality long video generation, we build the LongTake-HD dataset, consisting of 261k content-rich videos with scenario coherence, annotated with an overall video caption and five progressive sub-captions. Experiments show that our Presto achieves 78.5% on the VBench Semantic Score and 100% on the Dynamic Degree, outperforming existing state-of-the-art video generation methods. This demonstrates that our proposed Presto significantly enhances content richness, maintains long-range coherence, and captures intricate textual details. More details are displayed on our project page: https://presto-video.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Generate long coherent videos with rich content

Maintain scenario diversity over extended durations

Enhance textual detail capture in video generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Segmented Cross-Attention for temporal coherence

Content-rich LongTake-HD video dataset

Parameter-free DiT-based architecture integration

🔎 Similar Papers

Latte: Latent Diffusion Transformer for Video Generation

2024-01-05arXiv.orgCitations: 204

Real-Time Video Generation with Pyramid Attention Broadcast

2024-08-22arXiv.orgCitations: 16