Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention

📅 2026-02-04

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This work addresses the computational bottleneck in autoregressive video generation caused by the quadratic complexity of standard attention, which severely limits inference efficiency, while existing sparse attention methods often compromise generation quality. To overcome this, we propose the first tailored sparse attention scheme for autoregressive video diffusion models, featuring a block-aware growth mechanism that dynamically allocates sparsity patterns and a hierarchical sparse attention design with frame-level and block-level masks to jointly model multi-granular historical and local context. Integrated with FP8 quantization and LightVAE, our approach achieves a VBench score of 84.5, delivers 1.2–1.3× end-to-end speedup, and attains 19.7 FPS—2.3× faster—on an RTX 5090 GPU.

Technology Category

Application Category

📝 Abstract

Advanced autoregressive (AR) video generation models have improved visual fidelity and interactivity, but the quadratic complexity of attention remains a primary bottleneck for efficient deployment. While existing sparse attention solutions have shown promise on bidirectional models, we identify that applying these solutions to AR models leads to considerable performance degradation for two reasons: isolated consideration of chunk generation and insufficient utilization of past informative context. Motivated by these observations, we propose \textsc{Light Forcing}, the \textit{first} sparse attention solution tailored for AR video generation models. It incorporates a \textit{Chunk-Aware Growth} mechanism to quantitatively estimate the contribution of each chunk, which determines their sparsity allocation. This progressive sparsity increase strategy enables the current chunk to inherit prior knowledge in earlier chunks during generation. Additionally, we introduce a \textit{Hierarchical Sparse Attention} to capture informative historical and local context in a coarse-to-fine manner. Such two-level mask selection strategy (\ie, frame and block level) can adaptively handle diverse attention patterns. Extensive experiments demonstrate that our method outperforms existing sparse attention in quality (\eg, 84.5 on VBench) and efficiency (\eg, $1.2{\sim}1.3\times$ end-to-end speedup). Combined with FP8 quantization and LightVAE, \textsc{Light Forcing} further achieves a $2.3\times$ speedup and 19.7\,FPS on an RTX~5090 GPU. Code will be released at \href{https://github.com/chengtao-lv/LightForcing}{https://github.com/chengtao-lv/LightForcing}.

Problem

Research questions and friction points this paper is trying to address.

autoregressive video generation

sparse attention

attention complexity

video diffusion

context utilization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Attention

Autoregressive Video Generation

Chunk-Aware Growth