OWL: Overcoming Window Length-Dependence in Speculative Decoding for Long-Context Inputs

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Existing speculative decoding methods suffer from sharply declining acceptance rates and diminished inference acceleration under long-context inputs. This paper introduces OWL, an efficient speculative decoding framework tailored for long-text generation. Its core contributions are: (1) a lightweight LSTM-based draft model that operates solely on the final token’s hidden state, eliminating dependence on fixed-length context windows; (2) a verifier augmented with a dedicated [SPEC] token to enhance long-range semantic modeling; and (3) a hybrid verification strategy integrating tree-structured and sequential decoding to improve both acceptance length and robustness. Evaluated on LongSpecBench—a newly constructed long-context benchmark—we find that OWL increases average acceptance length by approximately 5× over EAGLE3 and achieves up to 1.23× end-to-end speedup. The code and dataset are publicly released.

Technology Category

Application Category

📝 Abstract

Speculative decoding promises faster inference for large language models (LLMs), yet existing methods fail to generalize to real-world settings. Benchmarks typically assume short contexts (e.g., 2K tokens), whereas practical workloads involve long contexts. We find current approaches degrade severely with long contexts; for instance, EAGLE3 even slows down the generation speed by 0.81x. We address these limitations by releasing a new long-context benchmark (LongSpecBench) and introducing a novel model (OWL). OWL achieves about 5x higher acceptance length than EAGLE3 on long-context inputs through three innovations: (1) an LSTM-based drafter conditioned only on the last-token state, making it generalize to various lengths, (2) a special token [SPEC] in the verifier that produces richer representation for drafter, and (3) a hybrid algorithm combining both tree and non-tree decoding methods. We release all code and datasets to advance future research.

Problem

Research questions and friction points this paper is trying to address.

Overcoming performance degradation in speculative decoding with long contexts

Addressing window length-dependence in speculative decoding methods

Improving acceptance length for long-context inputs in LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

LSTM-based drafter conditioned on last-token state

Special token [SPEC] produces richer drafter representation

Hybrid algorithm combining tree and non-tree decoding

🔎 Similar Papers

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling