🤖 AI Summary
This work addresses the performance degradation of speculative decoding caused by the mismatch between a target large language model and its draft model after domain-specific fine-tuning, a problem exacerbated by the prohibitive cost of retraining a dedicated draft model for each fine-tuned variant. To overcome this, the authors propose EDA, an efficient draft model adaptation framework that decouples shared and private components, employs a target-model-guided data regeneration strategy, and incorporates a high-value sample selection mechanism. This approach enables effective adaptation at both the parameter and data levels without requiring full retraining. Experimental results demonstrate that EDA substantially restores speculative decoding performance across multiple fine-tuned models, achieving higher average accepted lengths than baseline methods while significantly reducing training overhead.
📝 Abstract
Speculative decoding accelerates LLM inference but suffers from performance degradation when target models are fine-tuned for specific domains. A naive solution is to retrain draft models for every target model, which is costly and inefficient. To address this, we introduce a parameter- and data-efficient framework named Efficient Draft Adaptation, abbreviated as EDA, for efficiently adapting draft models. EDA introduces three innovations: (1) a decoupled architecture that utilizes shared and private components to model the shared and target-specific output distributions separately, enabling parameter-efficient adaptation by updating only the lightweight private component;(2) a data regeneration strategy that utilizes the fine-tuned target model to regenerate training data, thereby improving the alignment between training and speculative decoding, leading to higher average acceptance length;(3) a sample selection mechanism that prioritizes high-value data for efficient adaptation. Our experiments show that EDA effectively restores speculative performance on fine-tuned models, achieving superior average acceptance lengths with significantly reduced training costs compared to full retraining. Code is available at https://github.com/Lyn-Lucy/Efficient-Draft-Adaptation.