VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping

📅 2025-11-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Visual autoregressive (AR) image generation suffers from high inference latency due to sequential token-by-token prediction. This work addresses the inherent “draft-one-verify-one” bottleneck in speculative decoding for visual AR models by introducing, for the first time, a **partial verification skipping mechanism**—breaking the conventional requirement of full-token verification. Our core contributions are: (1) a dynamic, verification-free token selection strategy leveraging visual token interchangeability and adaptive truncation; (2) token-level feature caching and reuse to avoid redundant computations; and (3) fine-grained skipping scheduling that adaptively determines which tokens to skip verifying. The method significantly reduces target model forward passes—by 2.8× compared to standard AR decoding—while preserving generation quality. It thus achieves a superior speed–quality trade-off for visual AR generation without compromising fidelity or diversity.

Technology Category

Application Category

📝 Abstract
Visual autoregressive (AR) generation models have demonstrated strong potential for image generation, yet their next-token-prediction paradigm introduces considerable inference latency. Although speculative decoding (SD) has been proven effective for accelerating visual AR models, its "draft one step, then verify one step" paradigm prevents a direct reduction of the forward passes, thus restricting acceleration potential. Motivated by the visual token interchangeability, we for the first time to explore verification skipping in the SD process of visual AR model generation to explicitly cut the number of target model forward passes, thereby reducing inference latency. Based on an analysis of the drafting stage's characteristics, we observe that verification redundancy and stale feature reusability are key factors to retain generation quality and speedup for verification-free steps. Inspired by these two observations, we propose a novel SD framework VVS to accelerate visual AR generation via partial verification skipping, which integrates three complementary modules: (1) a verification-free token selector with dynamical truncation, (2) token-level feature caching and reuse, and (3) fine-grained skipped step scheduling. Consequently, VVS reduces the number of target model forward passes by a factor of $2.8 imes$ relative to vanilla AR decoding while maintaining competitive generation quality, offering a superior speed-quality trade-off over conventional SD frameworks and revealing strong potential to reshape the SD paradigm.
Problem

Research questions and friction points this paper is trying to address.

Reducing inference latency in visual autoregressive image generation models
Overcoming limitations of speculative decoding by skipping verification steps
Maintaining generation quality while cutting target model forward passes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Partial verification skipping for visual AR models
Token-level feature caching and reuse mechanism
Fine-grained skipped step scheduling optimization
🔎 Similar Papers
No similar papers found.
H
Haotian Dong
Tsinghua University
Y
Ye Li
Tsinghua University
Rongwei Lu
Rongwei Lu
Tsinghua University
Distributed machine learninggradient compressionfederated learning
C
Chen Tang
The Chinese University of Hong Kong
Shu-Tao Xia
Shu-Tao Xia
SIGS, Tsinghua University
coding and information theorymachine learningcomputer visionAI security
Z
Zhi Wang
Tsinghua University