🤖 AI Summary
Existing speculative decoding methods suffer from sharply declining acceptance rates and diminished inference acceleration under long-context inputs. This paper introduces OWL, an efficient speculative decoding framework tailored for long-text generation. Its core contributions are: (1) a lightweight LSTM-based draft model that operates solely on the final token’s hidden state, eliminating dependence on fixed-length context windows; (2) a verifier augmented with a dedicated [SPEC] token to enhance long-range semantic modeling; and (3) a hybrid verification strategy integrating tree-structured and sequential decoding to improve both acceptance length and robustness. Evaluated on LongSpecBench—a newly constructed long-context benchmark—we find that OWL increases average acceptance length by approximately 5× over EAGLE3 and achieves up to 1.23× end-to-end speedup. The code and dataset are publicly released.
📝 Abstract
Speculative decoding promises faster inference for large language models (LLMs), yet existing methods fail to generalize to real-world settings. Benchmarks typically assume short contexts (e.g., 2K tokens), whereas practical workloads involve long contexts. We find current approaches degrade severely with long contexts; for instance, EAGLE3 even slows down the generation speed by 0.81x. We address these limitations by releasing a new long-context benchmark (LongSpecBench) and introducing a novel model (OWL). OWL achieves about 5x higher acceptance length than EAGLE3 on long-context inputs through three innovations: (1) an LSTM-based drafter conditioned only on the last-token state, making it generalize to various lengths, (2) a special token [SPEC] in the verifier that produces richer representation for drafter, and (3) a hybrid algorithm combining both tree and non-tree decoding methods. We release all code and datasets to advance future research.