Learning Harmonized Representations for Speculative Sampling

📅 2024-08-28

📈 Citations: 3

✨ Influential: 0

career value

178K/year

🤖 AI Summary

To address context inconsistency and objective mismatch between training and decoding in LLM speculative decoding, this paper proposes the Harmonized Representation Learning framework, HASS. HASS introduces two novel mechanisms: harmonized objective distillation and harmonized context alignment—achieved via joint hidden-state and KV-cache representation alignment, multi-stage knowledge distillation, and a lightweight auxiliary prediction head. These components jointly unify the representation space and optimization objectives across training and inference, eliminating domain mismatch without introducing inference overhead. HASS is plug-and-play compatible and enables zero-cost acceleration. Empirical evaluation on four LLaMA variants shows average speedups of 2.81×–4.05×, outperforming the state-of-the-art EAGLE-2 by 8%–20% in speculative decoding efficiency while significantly improving generalization.

Technology Category

Application Category

📝 Abstract

Speculative sampling is a promising approach to accelerate the decoding stage for Large Language Models (LLMs). Recent advancements that leverage target LLM's contextual information, such as hidden states and KV cache, have shown significant practical improvements. However, these approaches suffer from inconsistent context between training and decoding. We also observe another discrepancy between the training and decoding objectives in existing speculative sampling methods. In this work, we propose a solution named HArmonized Speculative Sampling (HASS) that learns harmonized representations to address these issues. HASS accelerates the decoding stage without adding inference overhead through harmonized objective distillation and harmonized context alignment. Experiments on four LLaMA models demonstrate that HASS achieves 2.81x-4.05x wall-clock time speedup ratio averaging across three datasets, surpassing EAGLE-2 by 8%-20%. The code is available at https://github.com/HArmonizedSS/HASS.

Problem

Research questions and friction points this paper is trying to address.

Inconsistent context in speculative sampling

Discrepancy between training and decoding objectives

Harmonized representations for faster decoding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Harmonized objective distillation

Harmonized context alignment

Accelerates decoding without overhead

🔎 Similar Papers

Mitigating Low-Frequency Bias: Feature Recalibration and Frequency Attention Regularization for Adversarial Robustness