SAGE: Accelerating Vision-Language Models via Entropy-Guided Adaptive Speculative Decoding

📅 2026-01-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing visual-language models typically employ static tree structures for speculative decoding, which struggle to adapt to varying prediction difficulties across generation steps, thereby limiting both accepted sequence length and acceleration gains. To address this, this work proposes SAGE, a novel framework that dynamically adjusts the speculative tree structure based on output entropy as a confidence metric: deep and narrow trees are constructed under high-confidence conditions to maximize speculation depth, while shallow and wide trees are used under low-confidence conditions to enhance diversity. Integrating adaptive tree construction with a parallel multi-token verification mechanism, SAGE significantly improves inference efficiency. Experiments demonstrate speedups of up to 3.36× and 3.18× on LLaVA-OneVision-72B and Qwen2.5-VL-72B, respectively, without compromising output quality.

Technology Category

Application Category

📝 Abstract
Speculative decoding has emerged as a promising approach to accelerate inference in vision-language models (VLMs) by enabling parallel verification of multiple draft tokens. However, existing methods rely on static tree structures that remain fixed throughout the decoding process, failing to adapt to the varying prediction difficulty across generation steps. This leads to suboptimal acceptance lengths and limited speedup. In this paper, we propose SAGE, a novel framework that dynamically adjusts the speculation tree structure based on real-time prediction uncertainty. Our key insight is that output entropy serves as a natural confidence indicator with strong temporal correlation across decoding steps. SAGE constructs deeper-narrower trees for high-confidence predictions to maximize speculation depth, and shallower-wider trees for uncertain predictions to diversify exploration. SAGE improves acceptance lengths and achieves faster acceleration compared to static tree baselines. Experiments on multiple benchmarks demonstrate the effectiveness of SAGE: without any loss in output quality, it delivers up to $3.36\times$ decoding speedup for LLaVA-OneVision-72B and $3.18\times$ for Qwen2.5-VL-72B.
Problem

Research questions and friction points this paper is trying to address.

speculative decoding
vision-language models
adaptive tree structure
prediction uncertainty
decoding acceleration
Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding
entropy-guided adaptation
vision-language models
dynamic tree structure
inference acceleration
🔎 Similar Papers
No similar papers found.
Yujia Tong
Yujia Tong
Wuhan University of Technology
Machine LearningEfficient Computing
T
Tian Zhang
School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Hubei 430070, China
Y
Yunyang Wan
School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Hubei 430070, China
K
Kaiwei Lin
School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Hubei 430070, China
J
Jingling Yuan
School of Computer Science and Artificial Intelligence, Wuhan University of Technology, Hubei 430070, China
C
Chuang Hu
School of Computer Science, Wuhan University, Hubei 430072, China