SD$^2$: Self-Distilled Sparse Drafters

📅 2025-04-10

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

To address high inference latency in large language models (LLMs), low acceptance rates of draft models in speculative decoding, and poor cross-model-family generalization, this paper proposes the Self-Distilled Sparse Draft Model (SDSM), introducing the first integrated framework combining *self-data distillation* with *structured weight sparsity*. SDSM constructs high-quality draft distributions via lightweight self-distillation—without requiring gradients from the target model—and incorporates fine-grained structured sparsity during training to enhance alignment with the target model while improving draft quality and computational efficiency. Experiments on Llama-3.1-70B demonstrate a 1.59× speedup in mean acceptance length (MAL); compared to dense draft models, SDSM reduces multiply-accumulate operations (MACs) by 43.87% with only an 8.36% MAL degradation. The method exhibits strong generalizability across model families and practical deployment viability.

Technology Category

Application Category

📝 Abstract

Speculative decoding is a powerful technique for reducing the latency of Large Language Models (LLMs), offering a fault-tolerant framework that enables the use of highly compressed draft models. In this work, we introduce Self-Distilled Sparse Drafters (SD$^2$), a novel methodology that leverages self-data distillation and fine-grained weight sparsity to produce highly efficient and well-aligned draft models. SD$^2$ systematically enhances draft token acceptance rates while significantly reducing Multiply-Accumulate operations (MACs), even in the Universal Assisted Generation (UAG) setting, where draft and target models originate from different model families. On a Llama-3.1-70B target model, SD$^2$ provides a $ imes$1.59 higher Mean Accepted Length (MAL) compared to layer-pruned draft models and reduces MACs by over 43.87% with a 8.36% reduction in MAL compared to a dense draft models. Our results highlight the potential of sparsity-aware fine-tuning and compression strategies to improve LLM inference efficiency while maintaining alignment with target models.

Problem

Research questions and friction points this paper is trying to address.

Reduces LLM latency via efficient speculative decoding

Enhances draft token acceptance with sparse models

Improves inference efficiency while maintaining model alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-distilled sparse drafters enhance efficiency

Fine-grained weight sparsity reduces MAC operations

Improves token acceptance with sparsity-aware fine-tuning

🔎 Similar Papers

No similar papers found.