SparVAR: Exploring Sparsity in Visual AutoRegressive Modeling for Training-Free Acceleration

📅 2026-02-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the prohibitive computational cost of full self-attention in visual autoregressive (VAR) models for high-resolution image generation, where complexity scales quartically with resolution, and existing acceleration methods often compromise high-frequency details. The authors uncover three key sparsity properties inherent to VAR attention: strong convergence, cross-scale activation similarity, and pronounced locality. Leveraging these insights, they propose a training-free dynamic sparse attention framework that predicts sparse attention patterns at high-resolution layers via cross-scale index mapping and implements an efficient block-sparse attention kernel. Their method reduces inference time for 1024×1024 image generation with an 8B-parameter model to under one second—1.57× faster than FlashAttention—with negligible loss in high-frequency fidelity. Further integration of a scale-skipping strategy achieves a 2.28× speedup while preserving high-fidelity generation quality.

Technology Category

Application Category

📝 Abstract
Visual AutoRegressive (VAR) modeling has garnered significant attention for its innovative next-scale prediction paradigm. However, mainstream VAR paradigms attend to all tokens across historical scales at each autoregressive step. As the next scale resolution grows, the computational complexity of attention increases quartically with resolution, causing substantial latency. Prior accelerations often skip high-resolution scales, which speeds up inference but discards high-frequency details and harms image quality. To address these problems, we present SparVAR, a training-free acceleration framework that exploits three properties of VAR attention: (i) strong attention sinks, (ii) cross-scale activation similarity, and (iii) pronounced locality. Specifically, we dynamically predict the sparse attention pattern of later high-resolution scales from a sparse decision scale, and construct scale self-similar sparse attention via an efficient index-mapping mechanism, enabling high-efficiency sparse attention computation at large scales. Furthermore, we propose cross-scale local sparse attention and implement an efficient block-wise sparse kernel, which achieves $\mathbf{>5\times}$ faster forward speed than FlashAttention. Extensive experiments demonstrate that the proposed SparseVAR can reduce the generation time of an 8B model producing $1024\times1024$ high-resolution images to the 1s, without skipping the last scales. Compared with the VAR baseline accelerated by FlashAttention, our method achieves a $\mathbf{1.57\times}$ speed-up while preserving almost all high-frequency details. When combined with existing scale-skipping strategies, SparseVAR attains up to a $\mathbf{2.28\times}$ acceleration, while maintaining competitive visual generation quality. Code is available at https://github.com/CAS-CLab/SparVAR.
Problem

Research questions and friction points this paper is trying to address.

Visual AutoRegressive
computational complexity
high-resolution image generation
attention mechanism
inference latency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparse Attention
Visual Autoregressive Modeling
Training-Free Acceleration
Cross-Scale Similarity
Block-wise Sparse Kernel
🔎 Similar Papers
2024-03-04Computer Vision and Pattern RecognitionCitations: 3
Z
Zekun Li
Institute of Automation, Chinese Academy of Sciences
N
Ning Wang
Institute of Automation, Chinese Academy of Sciences
Tongxin Bai
Tongxin Bai
BAAI
AI systemsaccelerated computing
C
Changwang Mei
Institute of Automation, Chinese Academy of Sciences
Peisong Wang
Peisong Wang
CASIA
Deep Neural Network Acceleration and Compression
Shuang Qiu
Shuang Qiu
City University of Hong Kong
Reinforcement LearningAgentic AILarge Language ModelsEmbodied AI
J
Jian Cheng
Institute of Automation, Chinese Academy of Sciences