LoViC: Efficient Long Video Generation with Context Compression

๐Ÿ“… 2025-07-17
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
To address the high computational complexity of self-attention, weak temporal coherence, and limited scalability in diffusion transformers (DiTs) for long-video generation, this paper proposes FlexFormer. Methodologically, FlexFormer integrates single-query attention with position-aware mechanisms to support variable-length inputs and linearly adjustable compression ratios; employs a Q-Former variational autoencoder, sparse attention, and context compression for joint video-text latent-space modeling; and adopts a segment-wise generation scheme with unified latent representation to enable efficient modeling of million-scale open-domain videos. Experiments demonstrate significant improvements in generation efficiency and temporal consistency across prediction, retroactive generation, interpolation, and multi-shot synthesis tasks. FlexFormer achieves strong generalization across diverse video domains while maintaining practical deployability.

Technology Category

Application Category

๐Ÿ“ Abstract
Despite recent advances in diffusion transformers (DiTs) for text-to-video generation, scaling to long-duration content remains challenging due to the quadratic complexity of self-attention. While prior efforts -- such as sparse attention and temporally autoregressive models -- offer partial relief, they often compromise temporal coherence or scalability. We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos, designed to produce long, coherent videos through a segment-wise generation process. At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations. It supports variable-length inputs with linearly adjustable compression rates, enabled by a single query token design based on the Q-Former architecture. Additionally, by encoding temporal context through position-aware mechanisms, our model seamlessly supports prediction, retradiction, interpolation, and multi-shot generation within a unified paradigm. Extensive experiments across diverse tasks validate the effectiveness and versatility of our approach.
Problem

Research questions and friction points this paper is trying to address.

Scaling diffusion transformers for long video generation
Maintaining temporal coherence in autoregressive video models
Compressing video and text into unified latent representations
Innovation

Methods, ideas, or system contributions that make the work stand out.

FlexFormer compresses video and text jointly
Single query token enables adjustable compression
Position-aware mechanisms encode temporal context