🤖 AI Summary
This work addresses the efficiency bottleneck in speculative decoding caused by distributional mismatch between general-purpose draft models and downstream tasks. To overcome this limitation, the authors propose a task-adaptive, lightweight training and fusion strategy for draft models. Specifically, they fine-tune HASS and EAGLE-2 architectures on task-specific datasets such as MathInstruct and ShareGPT, and integrate a confidence-based routing mechanism with a merged-tree verification approach. This design significantly improves both the acceptance length of speculative tokens and task adaptability. Experimental results demonstrate that task-specialized draft models achieve superior performance on their respective benchmarks, while mixed-task training enhances robustness. Moreover, the proposed fusion strategy outperforms conventional weight averaging, delivering state-of-the-art overall results across multiple evaluation metrics.
📝 Abstract
Speculative decoding accelerates autoregressive generation by letting a lightweight draft model propose future tokens that a larger target model then verifies in parallel. In practice, however, draft models are usually trained on broad generic corpora, which leaves it unclear how much speculative decoding quality depends on the draft training distribution. We study this question with lightweight HASS and EAGLE-2 drafters trained on MathInstruct, ShareGPT, and mixed-data variants, evaluated on MT-Bench, GSM8K, MATH-500, and SVAMP. Measured by acceptance length, task-specific training yields clear specialization: MathInstruct-trained drafts are strongest on reasoning benchmarks, while ShareGPT-trained drafts are strongest on MT-Bench. Mixed-data training improves robustness, but larger mixtures do not dominate across decoding temperatures. We also study how to combine specialized drafters at inference time. Naive checkpoint averaging performs poorly, whereas confidence-based routing improves over single-domain drafts and merged-tree verification yields the highest acceptance length overall for both backbones. Finally, confidence is a more useful routing signal than entropy: rejected tokens tend to have higher entropy, but confidence produces much clearer benchmark-level routing decisions. These results show that speculative decoding quality depends not only on draft architecture, but also on the match between draft training data and downstream workload, and that specialized drafters are better combined at inference time than in weight space.