🤖 AI Summary
This work addresses the quadratic growth in training complexity of parallel draft generation under long-context scenarios, which scales with the product of sequence length and the number of parallel prediction positions, hindering scalability. To overcome this limitation, the authors extend EAGLE from autoregressive to parallel multi-token prediction by introducing a learnable shared hidden state mechanism. They further integrate precomputed attention masks and sequence chunking techniques, enabling gradient accumulation within a single sequence for the first time. This approach significantly improves training efficiency. Experiments on GPT-OSS 120B/20B and Qwen3-Coder 30B models demonstrate inference speedups of 1.10–1.36× compared to the autoregressive EAGLE-3 baseline.
📝 Abstract
Reasoning LLMs produce longer outputs, requiring speculative decoding drafters trained on extended sequences. Parallel drafting - predicting multiple tokens per forward pass - offers latency benefits over sequential generation, but training complexity scales quadratically with the product of sequence length and parallel positions, rendering long-context training impractical. We present P(arallel)-EAGLE, which transforms EAGLE from autoregressive to parallel multi-token prediction via a learnable shared hidden state. To scale training to long contexts, we develop a framework featuring attention mask pre-computation and sequence partitioning techniques, enabling gradient accumulation within individual sequences for parallel-prediction training. We implement P-EAGLE in vLLM and demonstrate speedups of 1.10-1.36x over autoregressive EAGLE-3 across GPT-OSS 120B, 20B, and Qwen3-Coder 30B.