Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation

📅 2026-04-11

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

This work addresses the challenges in streaming video generation where sliding window attention struggles to capture long-range dependencies and incurs high computational costs. To overcome these limitations, the authors propose a hybrid attention mechanism that integrates lightweight linear temporal attention to preserve distant historical context and introduces block-sparse sliding window attention to reduce redundant local computations. The method employs a decoupled distillation strategy to train an autoregressive video diffusion model in stages and maintains incremental key-value states for efficient inference. Evaluated on both long and short video generation benchmarks, the approach achieves state-of-the-art performance, enabling real-time, unbounded video synthesis at 832×480 resolution with 29.5 FPS on a single H100 GPU—without requiring quantization or compression.

Technology Category

Application Category

📝 Abstract

Streaming video generation (SVG) distills a pretrained bidirectional video diffusion model into an autoregressive model equipped with sliding window attention (SWA). However, SWA inevitably loses distant history during long video generation, and its computational overhead remains a critical challenge to real-time deployment. In this work, we propose Hybrid Forcing, which jointly optimizes temporal information retention and computational efficiency through a hybrid attention design. First, we introduce lightweight linear temporal attention to preserve long-range dependencies beyond the sliding window. In particular, we maintain a compact key-value state to incrementally absorb evicted tokens, retaining temporal context with negligible memory and computational overhead. Second, we incorporate block-sparse attention into the local sliding window to reduce redundant computation within short-range modeling, reallocating computational capacity toward more critical dependencies. Finally, we introduce a decoupled distillation strategy tailored to the hybrid attention design. A few-step initial distillation is performed under dense attention, then the distillation of our proposed linear temporal and block-sparse attention is activated for streaming modeling, ensuring stable optimization. Extensive experiments on both short- and long-form video generation benchmarks demonstrate that Hybrid Forcing consistently achieves state-of-the-art performance. Notably, our model achieves real-time, unbounded 832x480 video generation at 29.5 FPS on a single NVIDIA H100 GPU without quantization or model compression. The source code and trained models are available at https://github.com/leeruibin/hybrid-forcing.

Problem

Research questions and friction points this paper is trying to address.

Streaming Video Generation

Long-Horizon Video

Sliding Window Attention

Computational Efficiency

Temporal Context

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid Attention

Streaming Video Generation

Decoupled Distillation