🤖 AI Summary
To address high inference latency in large language models (LLMs), low acceptance rates of draft models in speculative decoding, and poor cross-model-family generalization, this paper proposes the Self-Distilled Sparse Draft Model (SDSM), introducing the first integrated framework combining *self-data distillation* with *structured weight sparsity*. SDSM constructs high-quality draft distributions via lightweight self-distillation—without requiring gradients from the target model—and incorporates fine-grained structured sparsity during training to enhance alignment with the target model while improving draft quality and computational efficiency. Experiments on Llama-3.1-70B demonstrate a 1.59× speedup in mean acceptance length (MAL); compared to dense draft models, SDSM reduces multiply-accumulate operations (MACs) by 43.87% with only an 8.36% MAL degradation. The method exhibits strong generalizability across model families and practical deployment viability.
📝 Abstract
Speculative decoding is a powerful technique for reducing the latency of Large Language Models (LLMs), offering a fault-tolerant framework that enables the use of highly compressed draft models. In this work, we introduce Self-Distilled Sparse Drafters (SD$^2$), a novel methodology that leverages self-data distillation and fine-grained weight sparsity to produce highly efficient and well-aligned draft models. SD$^2$ systematically enhances draft token acceptance rates while significantly reducing Multiply-Accumulate operations (MACs), even in the Universal Assisted Generation (UAG) setting, where draft and target models originate from different model families. On a Llama-3.1-70B target model, SD$^2$ provides a $ imes$1.59 higher Mean Accepted Length (MAL) compared to layer-pruned draft models and reduces MACs by over 43.87% with a 8.36% reduction in MAL compared to a dense draft models. Our results highlight the potential of sparsity-aware fine-tuning and compression strategies to improve LLM inference efficiency while maintaining alignment with target models.