🤖 AI Summary
This work identifies a task-wise acceleration inequality in speculative decoding for large language model inference: acceleration degrades significantly on underfitted or data-underrepresented tasks, compromising fairness in inference efficiency. To address this, we first propose a quantifiable metric—“acceleration fairness”—and design an optimization framework based on probabilistic inference scheduling and inter-model collaborative decoding. Our method dynamically adapts draft model selection and verification strategies to mitigate biases induced by task heterogeneity. Empirical evaluation across multiple mainstream model pairs (e.g., Llama-3-8B/1B, Qwen2-7B/0.5B) demonstrates that our approach improves the acceleration fairness metric by 12% on average and reduces inter-task speedup variance by 37%. These results substantially enhance the robustness and generalizability of speculative decoding across diverse downstream tasks.
📝 Abstract
The practice of speculative decoding, whereby inference is probabilistically supported by a smaller, cheaper, ``drafter'' model, has become a standard technique for systematically reducing the decoding time of large language models. This paper conducts an analysis of speculative decoding through the lens of its potential disparate speed-up rates across tasks. Crucially, the paper shows that speed-up gained from speculative decoding is not uniformly distributed across tasks, consistently diminishing for under-fit, and often underrepresented tasks. To better understand this phenomenon, we derive an analysis to quantify this observed ``unfairness'' and draw attention to the factors that motivate such disparate speed-ups to emerge. Further, guided by these insights, the paper proposes a mitigation strategy designed to reduce speed-up disparities and validates the approach across several model pairs, revealing on average a 12% improvement in our fairness metric.