The Disparate Impacts of Speculative Decoding

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work identifies a task-wise acceleration inequality in speculative decoding for large language model inference: acceleration degrades significantly on underfitted or data-underrepresented tasks, compromising fairness in inference efficiency. To address this, we first propose a quantifiable metric—“acceleration fairness”—and design an optimization framework based on probabilistic inference scheduling and inter-model collaborative decoding. Our method dynamically adapts draft model selection and verification strategies to mitigate biases induced by task heterogeneity. Empirical evaluation across multiple mainstream model pairs (e.g., Llama-3-8B/1B, Qwen2-7B/0.5B) demonstrates that our approach improves the acceleration fairness metric by 12% on average and reduces inter-task speedup variance by 37%. These results substantially enhance the robustness and generalizability of speculative decoding across diverse downstream tasks.

Technology Category

Application Category

📝 Abstract

The practice of speculative decoding, whereby inference is probabilistically supported by a smaller, cheaper, ``drafter'' model, has become a standard technique for systematically reducing the decoding time of large language models. This paper conducts an analysis of speculative decoding through the lens of its potential disparate speed-up rates across tasks. Crucially, the paper shows that speed-up gained from speculative decoding is not uniformly distributed across tasks, consistently diminishing for under-fit, and often underrepresented tasks. To better understand this phenomenon, we derive an analysis to quantify this observed ``unfairness'' and draw attention to the factors that motivate such disparate speed-ups to emerge. Further, guided by these insights, the paper proposes a mitigation strategy designed to reduce speed-up disparities and validates the approach across several model pairs, revealing on average a 12% improvement in our fairness metric.

Problem

Research questions and friction points this paper is trying to address.

Speculative decoding causes unequal speed-up rates across different tasks

Under-fit and underrepresented tasks experience diminished decoding speed improvements

The paper analyzes and mitigates these disparate impacts on task performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Analyzed speculative decoding speed-up disparities across tasks

Proposed mitigation strategy to reduce speed-up disparities

Validated approach showing 12% fairness metric improvement

🔎 Similar Papers

Cascade Speculative Drafting for Even Faster LLM Inference