CORAL: Learning Consistent Representations across Multi-step Training with Lighter Speculative Drafter

📅 2025-02-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Speculative decoding suffers from training-inference mismatch, impeding lightweight draft model convergence and degrading both accuracy and efficiency. To address this, we propose a Stride-wise Representation Alignment (SRA) framework featuring three core innovations: (1) a novel stride-wise hidden-state consistency constraint to mitigate representation inconsistency under multi-step training; (2) an LM-head weight grouping-and-sparse-activation mechanism to overcome draft-model inference bottlenecks; and (3) joint multi-step co-training with lightweight architectural optimization. Evaluated across three mainstream large language models and three benchmark datasets, our method achieves 2.50×–4.07× inference speedup—outperforming state-of-the-art approaches including EAGLE-2 and HASS. To the best of our knowledge, this is the first work to jointly optimize representation alignment and efficient sparse activation under multi-step speculative decoding.

Technology Category

Application Category

📝 Abstract
Speculative decoding is a powerful technique that accelerates Large Language Model (LLM) inference by leveraging a lightweight speculative draft model. However, existing designs suffers in performance due to misalignment between training and inference. Recent methods have tried to solve this issue by adopting a multi-step training strategy, but the complex inputs of different training steps make it harder for the draft model to converge. To address this, we propose CORAL, a novel framework that improves both accuracy and efficiency in speculative drafting. CORAL introduces Cross-Step Representation Alignment, a method that enhances consistency across multiple training steps, significantly improving speculative drafting performance. Additionally, we identify the LM head as a major bottleneck in the inference speed of the draft model. We introduce a weight-grouping mechanism that selectively activates a subset of LM head parameters during inference, substantially reducing the latency of the draft model. We evaluate CORAL on three LLM families and three benchmark datasets, achieving speedup ratios of 2.50x-4.07x, outperforming state-of-the-art methods such as EAGLE-2 and HASS. Our results demonstrate that CORAL effectively mitigates training-inference misalignment and delivers significant speedup for modern LLMs with large vocabularies.
Problem

Research questions and friction points this paper is trying to address.

Addresses misalignment in speculative decoding training
Enhances consistency across multi-step training processes
Reduces inference latency in large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Step Representation Alignment enhances consistency
Weight-grouping mechanism reduces LM head latency
Achieves 2.50x-4.07x speedup in LLM inference
🔎 Similar Papers
No similar papers found.
Yepeng Weng
Yepeng Weng
Researcher, Lenovo Research
Large Language ModelsComputer Vision
D
Dianwen Mei
Lenovo Research
H
Huishi Qiu
Lenovo Research
X
Xujie Chen
Lenovo Research
L
Li Liu
Lenovo Research
Jiang Tian
Jiang Tian
Principal Researcher, AI Lab, Lenovo Research
medical imaging processingdeep learningcomputer visioncomputer graphicsrobotics
Z
Zhongchao Shi
Lenovo Research