Efficiently Aligning Draft Models via Parameter- and Data-Efficient Adaptation

📅 2026-03-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the performance degradation of speculative decoding caused by the mismatch between a target large language model and its draft model after domain-specific fine-tuning, a problem exacerbated by the prohibitive cost of retraining a dedicated draft model for each fine-tuned variant. To overcome this, the authors propose EDA, an efficient draft model adaptation framework that decouples shared and private components, employs a target-model-guided data regeneration strategy, and incorporates a high-value sample selection mechanism. This approach enables effective adaptation at both the parameter and data levels without requiring full retraining. Experimental results demonstrate that EDA substantially restores speculative decoding performance across multiple fine-tuned models, achieving higher average accepted lengths than baseline methods while significantly reducing training overhead.

Technology Category

Application Category

📝 Abstract
Speculative decoding accelerates LLM inference but suffers from performance degradation when target models are fine-tuned for specific domains. A naive solution is to retrain draft models for every target model, which is costly and inefficient. To address this, we introduce a parameter- and data-efficient framework named Efficient Draft Adaptation, abbreviated as EDA, for efficiently adapting draft models. EDA introduces three innovations: (1) a decoupled architecture that utilizes shared and private components to model the shared and target-specific output distributions separately, enabling parameter-efficient adaptation by updating only the lightweight private component;(2) a data regeneration strategy that utilizes the fine-tuned target model to regenerate training data, thereby improving the alignment between training and speculative decoding, leading to higher average acceptance length;(3) a sample selection mechanism that prioritizes high-value data for efficient adaptation. Our experiments show that EDA effectively restores speculative performance on fine-tuned models, achieving superior average acceptance lengths with significantly reduced training costs compared to full retraining. Code is available at https://github.com/Lyn-Lucy/Efficient-Draft-Adaptation.
Problem

Research questions and friction points this paper is trying to address.

speculative decoding
draft model alignment
fine-tuned LLMs
parameter efficiency
data efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding
parameter-efficient adaptation
data regeneration
decoupled architecture
sample selection
🔎 Similar Papers
No similar papers found.
L
Luxi Lin
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China
Zhihang Lin
Zhihang Lin
Xiamen University & Shanghai Innovation Institute
Efficient Artificial Intelligence
Zhanpeng Zeng
Zhanpeng Zeng
University of Wisconsin Madison
Transformer Efficiency
Yuhao Chen
Yuhao Chen
University of Science and Technology of China
Large Language Model
Qingyu Zhang
Qingyu Zhang
Institute of Software, Chinese Academy of Sciences
Jixiang Luo
Jixiang Luo
Sensetime
Data compressionVideo CodingSignal Processing
X
Xuelong Li
Institute of Artificial Intelligence (TeleAI), China Telecom
R
Rongrong Ji
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, 361005, P.R. China