FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration

📅 2026-05-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

230K/year
🤖 AI Summary
This work addresses the limitations of existing speculative decoding methods in large-batch inference, where unpredictable reward tokens and acceptance lengths cause mismatches between draft and verification models, leading to severe throughput degradation—often exacerbated by reliance on costly pretraining or compromises in generation quality. To overcome these challenges, the authors propose FlexDraft, a framework that enables high-quality block-wise draft generation through fine-tuning only the final few attention projection layers. It further introduces a lightweight MLP module to calibrate draft logits based on reward tokens and incorporates a confidence-aware dynamic verification mechanism that adaptively switches between serial and parallel decoding strategies. FlexDraft achieves substantial throughput gains in large-batch settings while preserving generation quality, effectively breaking the scalability and efficiency bottlenecks of current speculative decoding approaches.
📝 Abstract
Speculative decoding accelerates memory-bound LLM inference without quality degradation by using a fast drafter to propose multiple candidate tokens and the target model to verify them in parallel. However, conventional sequential speculative decoding suffers from mutual waiting between drafting and verification, and repeated exchange of intermediate states further increases memory access overhead. Parallel speculative decoding addresses this limitation by performing drafting and verification within a single target forward pass, allowing future drafts to be prepared while current candidates are being verified. Although effective at small batch sizes, existing parallel speculative decoding methods either require costly continual pretraining with quality degradation or suffer from low acceptance rates. More importantly, this paradigm inherently suffers from uncertainty in both the bonus token and the accepted length, leading to draft verification mismatch and causing throughput gains to collapse at large batch sizes. To address these limitations, we introduce FlexDraft, a lossless speculative decoding framework that flexibly adapts to varying batch sizes through three key designs. (1) Attention Tuning enables block diffusion drafting by tuning only the attention projectors of the final few layers on mask tokens, while keeping the autoregressive path frozen to preserve the target distribution and produce high quality drafts with minimal trainable parameters. (2) Bonus-guided Calibration uses a lightweight MLP conditioned on the resolved bonus token to calibrate draft logits, mitigating draft verification mismatch caused by bonus token uncertainty. (3) Flex Decoding dynamically switches between parallel draft and verify at small batch sizes and sequential draft then verify at large batch sizes, and adjusts verification length based on draft confidence to eliminate redundant computation.
Problem

Research questions and friction points this paper is trying to address.

speculative decoding
batch size scalability
draft verification mismatch
bonus token uncertainty
LLM inference acceleration
Innovation

Methods, ideas, or system contributions that make the work stand out.

Speculative Decoding
Attention Tuning
Bonus-Guided Calibration
Flex Decoding
LLM Inference Acceleration
🔎 Similar Papers
No similar papers found.