TAPS: Task Aware Proposal Distributions for Speculative Sampling

📅 2026-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the efficiency bottleneck in speculative decoding caused by distributional mismatch between general-purpose draft models and downstream tasks. To overcome this limitation, the authors propose a task-adaptive, lightweight training and fusion strategy for draft models. Specifically, they fine-tune HASS and EAGLE-2 architectures on task-specific datasets such as MathInstruct and ShareGPT, and integrate a confidence-based routing mechanism with a merged-tree verification approach. This design significantly improves both the acceptance length of speculative tokens and task adaptability. Experimental results demonstrate that task-specialized draft models achieve superior performance on their respective benchmarks, while mixed-task training enhances robustness. Moreover, the proposed fusion strategy outperforms conventional weight averaging, delivering state-of-the-art overall results across multiple evaluation metrics.
📝 Abstract
Speculative decoding accelerates autoregressive generation by letting a lightweight draft model propose future tokens that a larger target model then verifies in parallel. In practice, however, draft models are usually trained on broad generic corpora, which leaves it unclear how much speculative decoding quality depends on the draft training distribution. We study this question with lightweight HASS and EAGLE-2 drafters trained on MathInstruct, ShareGPT, and mixed-data variants, evaluated on MT-Bench, GSM8K, MATH-500, and SVAMP. Measured by acceptance length, task-specific training yields clear specialization: MathInstruct-trained drafts are strongest on reasoning benchmarks, while ShareGPT-trained drafts are strongest on MT-Bench. Mixed-data training improves robustness, but larger mixtures do not dominate across decoding temperatures. We also study how to combine specialized drafters at inference time. Naive checkpoint averaging performs poorly, whereas confidence-based routing improves over single-domain drafts and merged-tree verification yields the highest acceptance length overall for both backbones. Finally, confidence is a more useful routing signal than entropy: rejected tokens tend to have higher entropy, but confidence produces much clearer benchmark-level routing decisions. These results show that speculative decoding quality depends not only on draft architecture, but also on the match between draft training data and downstream workload, and that specialized drafters are better combined at inference time than in weight space.
Problem

Research questions and friction points this paper is trying to address.

speculative decoding
draft model
training distribution
task specialization
inference-time combination
Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding
task-aware draft models
confidence-based routing
merged-tree verification
training distribution alignment
🔎 Similar Papers
No similar papers found.
M
Mohamad Zbib
King Abdullah University of Science and Technology (KAUST), American University of Beirut (AUB)
M
Mohamad Bazzi
American University of Beirut (AUB)
A
Ammar Mohanna
American University of Beirut (AUB)
Hasan Abed Al Kader Hammoud
Hasan Abed Al Kader Hammoud
King Abdullah University of Science and Technology
Deep LearningComputer VisionMachine Learning
Bernard Ghanem
Bernard Ghanem
Professor, King Abdullah University of Science and Technology
computer visionmachine learning