Training Domain Draft Models for Speculative Decoding: Best Practices and Insights

📅 2025-03-10

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Domain shift causes a sharp decline in acceptance rates of general draft models in speculative decoding. Method: This paper proposes a knowledge distillation–based training paradigm for domain-specific draft models. We systematically compare offline versus online and white-box versus black-box distillation strategies, finding offline distillation substantially outperforms online (by +11%–25%), and white-box distillation surpasses black-box (by +2%–10%). We further demonstrate that synthetically generated data achieves 80%–93% of the performance attained using real historical queries. Contribution/Results: Experiments across Function Calling, biomedical, and Chinese domains show our approach significantly improves draft model prediction accuracy and enhances inference acceleration in domain-adapted speculative decoding. The work provides a reusable training framework and empirically grounded guidelines for domain adaptation of large language models.

Technology Category

Application Category

📝 Abstract

Speculative decoding is an effective method for accelerating inference of large language models (LLMs) by employing a small draft model to predict the output of a target model. However, when adapting speculative decoding to domain-specific target models, the acceptance rate of the generic draft model drops significantly due to domain shift. In this work, we systematically investigate knowledge distillation techniques for training domain draft models to improve their speculation accuracy. We compare white-box and black-box distillation approaches and explore their effectiveness in various data accessibility scenarios, including historical user queries, curated domain data, and synthetically generated alignment data. Our experiments across Function Calling, Biology, and Chinese domains show that offline distillation consistently outperforms online distillation by 11% to 25%, white-box distillation surpasses black-box distillation by 2% to 10%, and data scaling trends hold across domains. Additionally, we find that synthetic data can effectively align draft models and achieve 80% to 93% of the performance of training on historical user queries. These findings provide practical guidelines for training domain-specific draft models to improve speculative decoding efficiency.

Problem

Research questions and friction points this paper is trying to address.

Improving speculative decoding efficiency for domain-specific models.

Comparing white-box and black-box distillation techniques.

Evaluating data sources like historical queries and synthetic data.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Knowledge distillation for domain draft models

White-box outperforms black-box distillation

Synthetic data aligns draft models effectively

🔎 Similar Papers

No similar papers found.