P-EAGLE: Parallel-Drafting EAGLE with Scalable Training

📅 2026-02-01

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This work addresses the quadratic growth in training complexity of parallel draft generation under long-context scenarios, which scales with the product of sequence length and the number of parallel prediction positions, hindering scalability. To overcome this limitation, the authors extend EAGLE from autoregressive to parallel multi-token prediction by introducing a learnable shared hidden state mechanism. They further integrate precomputed attention masks and sequence chunking techniques, enabling gradient accumulation within a single sequence for the first time. This approach significantly improves training efficiency. Experiments on GPT-OSS 120B/20B and Qwen3-Coder 30B models demonstrate inference speedups of 1.10–1.36× compared to the autoregressive EAGLE-3 baseline.

Technology Category

Application Category

📝 Abstract

Reasoning LLMs produce longer outputs, requiring speculative decoding drafters trained on extended sequences. Parallel drafting - predicting multiple tokens per forward pass - offers latency benefits over sequential generation, but training complexity scales quadratically with the product of sequence length and parallel positions, rendering long-context training impractical. We present P(arallel)-EAGLE, which transforms EAGLE from autoregressive to parallel multi-token prediction via a learnable shared hidden state. To scale training to long contexts, we develop a framework featuring attention mask pre-computation and sequence partitioning techniques, enabling gradient accumulation within individual sequences for parallel-prediction training. We implement P-EAGLE in vLLM and demonstrate speedups of 1.10-1.36x over autoregressive EAGLE-3 across GPT-OSS 120B, 20B, and Qwen3-Coder 30B.

Problem

Research questions and friction points this paper is trying to address.

parallel drafting

speculative decoding

long-context training

training complexity

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel Drafting

Speculative Decoding

Long-context Training