Alignment-Augmented Speculative Decoding with Alignment Sampling and Conditional Verification

📅 2025-05-19

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

To address low alignment between draft and target outputs and rigid verification—leading to constrained acceptance rates and accuracy in speculative decoding for large language models—this paper proposes a training-free inference acceleration method. Our approach introduces two key innovations: (1) an alignment-aware sampling strategy during the prefill phase, leveraging output distribution statistics to generate highly consistent draft sequences; and (2) a flexible, adaptive verification mechanism that dynamically adjusts probability thresholds to balance acceptance rate and correctness. Crucially, the method requires no model fine-tuning or additional training overhead. Evaluated across eight benchmark tasks, it achieves an average generation score improvement of 3.3 points, an average accepted draft length of 2.39 tokens, and a 2.23× end-to-end inference speedup—outperforming existing speculative decoding methods significantly.

Technology Category

Application Category

📝 Abstract

Recent works have revealed the great potential of speculative decoding in accelerating the autoregressive generation process of large language models. The success of these methods relies on the alignment between draft candidates and the sampled outputs of the target model. Existing methods mainly achieve draft-target alignment with training-based methods, e.g., EAGLE, Medusa, involving considerable training costs. In this paper, we present a training-free alignment-augmented speculative decoding algorithm. We propose alignment sampling, which leverages output distribution obtained in the prefilling phase to provide more aligned draft candidates. To further benefit from high-quality but non-aligned draft candidates, we also introduce a simple yet effective flexible verification strategy. Through an adaptive probability threshold, our approach can improve generation accuracy while further improving inference efficiency. Experiments on 8 datasets (including question answering, summarization and code completion tasks) show that our approach increases the average generation score by 3.3 points for the LLaMA3 model. Our method achieves a mean acceptance length up to 2.39 and speed up generation by 2.23.

Problem

Research questions and friction points this paper is trying to address.

Enhancing draft-target alignment in speculative decoding without training

Improving generation accuracy and efficiency via flexible verification

Accelerating autoregressive generation in large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Training-free alignment-augmented speculative decoding algorithm

Alignment sampling using prefilling phase distribution

Flexible verification with adaptive probability threshold

🔎 Similar Papers

Efficient Inference for Large Language Model-based Generative Recommendation