EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty

📅 2024-01-26

🏛️ International Conference on Machine Learning

📈 Citations: 88

✨ Influential: 22

career value

189K/year

🤖 AI Summary

To address high inference latency induced by autoregressive decoding in large language models (LLMs), this work identifies, for the first time, a critical uncertainty bottleneck in the second-to-top-layer feature space and proposes a novel feature-level speculative sampling paradigm. Methodologically, it departs from conventional token-level speculation by introducing temporal alignment prediction and a lightweight auxiliary decoder in the feature space; modeling feature sequences one step ahead effectively mitigates uncertainty while preserving output distribution fidelity. The proposed architecture is model-agnostic and supports mainstream LLMs including LLaMA2, Vicuna, and Mixtral. Evaluated on LLaMA2-Chat 70B, our approach reduces inference latency by 2.7×–3.5× and doubles throughput, with generated text distributions statistically indistinguishable from those of the base model.

Technology Category

Application Category

📝 Abstract

Autoregressive decoding makes the inference of Large Language Models (LLMs) time-consuming. In this paper, we reconsider speculative sampling and derive two key observations. Firstly, autoregression at the feature (second-to-top-layer) level is more straightforward than at the token level. Secondly, the inherent uncertainty in feature (second-to-top-layer) level autoregression constrains its performance. Based on these insights, we introduce EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), a simple yet highly efficient speculative sampling framework. By incorporating a token sequence advanced by one time step, EAGLE effectively resolves the uncertainty, enabling precise second-to-top-layer feature prediction with minimal overhead. We conducted comprehensive evaluations of EAGLE, including all models from the Vicuna and LLaMA2-Chat series, the MoE model Mixtral 8x7B Instruct, and tasks in dialogue, code generation, mathematical reasoning, and instruction following. For LLaMA2-Chat 70B, EAGLE achieved a latency speedup ratio of 2.7x-3.5x, doubled throughput, while maintaining the distribution of the generated text.

Problem

Research questions and friction points this paper is trying to address.

Reduces time-consuming autoregressive decoding in LLMs

Addresses uncertainty in feature-level autoregression

Improves speculative sampling efficiency with minimal overhead

Innovation

Methods, ideas, or system contributions that make the work stand out.

Feature-level autoregression simplifies speculative sampling.

EAGLE resolves uncertainty in feature-level predictions.

EAGLE achieves significant speedup and throughput improvements.

🔎 Similar Papers

Predictive Uncertainty Quantification for Bird's Eye View Segmentation: A Benchmark and Novel Loss Function