AdaSD: Adaptive Speculative Decoding for Efficient Language Model Inference

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

To address the significant slowdown in inference speed of large language models (LLMs) as parameter count increases, this paper proposes a training-free, pre-analysis-free adaptive speculative decoding framework. The core innovation is a novel dual-adaptive threshold mechanism that dynamically adjusts both candidate sequence length and acceptance thresholds in real time—based on token-level entropy and Jensen–Shannon divergence—enabling unsupervised online decision-making. Unlike prior approaches, our method requires no hyperparameter tuning and is fully compatible with off-the-shelf LLMs. Evaluated on standard benchmarks, it achieves up to a 49% inference speedup over baseline speculative decoding methods, while maintaining accuracy degradation within 2%. This substantially enhances both the practicality and generalizability of speculative decoding across diverse LLMs and deployment scenarios.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have achieved remarkable performance across a wide range of tasks, but their increasing parameter sizes significantly slow down inference. Speculative decoding mitigates this issue by leveraging a smaller draft model to predict candidate tokens, which are then verified by a larger target model. However, existing approaches often require additional training, extensive hyperparameter tuning, or prior analysis of models and tasks before deployment. In this paper, we propose Adaptive Speculative Decoding (AdaSD), a hyperparameter-free decoding scheme that dynamically adjusts generation length and acceptance criteria during inference. AdaSD introduces two adaptive thresholds: one to determine when to stop candidate token generation and another to decide token acceptance, both updated in real time based on token entropy and Jensen-Shannon distance. This approach eliminates the need for pre-analysis or fine-tuning and is compatible with off-the-shelf models. Experiments on benchmark datasets demonstrate that AdaSD achieves up to 49% speedup over standard speculative decoding while limiting accuracy degradation to under 2%, making it a practical solution for efficient and adaptive LLM inference.

Problem

Research questions and friction points this paper is trying to address.

Speeds up large language model inference without extra training

Dynamically adjusts speculative decoding parameters during inference

Eliminates need for pre-analysis or hyperparameter tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic adjustment of generation length and acceptance criteria

Real-time thresholds based on token entropy and Jensen-Shannon distance

Hyperparameter-free, compatible with off-the-shelf models without fine-tuning

🔎 Similar Papers

No similar papers found.