Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation

πŸ“… 2026-04-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

224K/year
πŸ€– AI Summary
This work addresses the challenges in streaming video generation where sliding window attention struggles to capture long-range dependencies and incurs high computational costs. To overcome these limitations, the authors propose a hybrid attention mechanism that integrates lightweight linear temporal attention to preserve distant historical context and introduces block-sparse sliding window attention to reduce redundant local computations. The method employs a decoupled distillation strategy to train an autoregressive video diffusion model in stages and maintains incremental key-value states for efficient inference. Evaluated on both long and short video generation benchmarks, the approach achieves state-of-the-art performance, enabling real-time, unbounded video synthesis at 832Γ—480 resolution with 29.5 FPS on a single H100 GPUβ€”without requiring quantization or compression.

Technology Category

Application Category

πŸ“ Abstract
Streaming video generation (SVG) distills a pretrained bidirectional video diffusion model into an autoregressive model equipped with sliding window attention (SWA). However, SWA inevitably loses distant history during long video generation, and its computational overhead remains a critical challenge to real-time deployment. In this work, we propose Hybrid Forcing, which jointly optimizes temporal information retention and computational efficiency through a hybrid attention design. First, we introduce lightweight linear temporal attention to preserve long-range dependencies beyond the sliding window. In particular, we maintain a compact key-value state to incrementally absorb evicted tokens, retaining temporal context with negligible memory and computational overhead. Second, we incorporate block-sparse attention into the local sliding window to reduce redundant computation within short-range modeling, reallocating computational capacity toward more critical dependencies. Finally, we introduce a decoupled distillation strategy tailored to the hybrid attention design. A few-step initial distillation is performed under dense attention, then the distillation of our proposed linear temporal and block-sparse attention is activated for streaming modeling, ensuring stable optimization. Extensive experiments on both short- and long-form video generation benchmarks demonstrate that Hybrid Forcing consistently achieves state-of-the-art performance. Notably, our model achieves real-time, unbounded 832x480 video generation at 29.5 FPS on a single NVIDIA H100 GPU without quantization or model compression. The source code and trained models are available at https://github.com/leeruibin/hybrid-forcing.
Problem

Research questions and friction points this paper is trying to address.

Streaming Video Generation
Long-Horizon Video
Sliding Window Attention
Computational Efficiency
Temporal Context
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Attention
Streaming Video Generation
Decoupled Distillation
Linear Temporal Attention
Block-Sparse Attention
πŸ”Ž Similar Papers