P-EAGLE: Parallel-Drafting EAGLE with Scalable Training

📅 2026-02-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the quadratic growth in training complexity of parallel draft generation under long-context scenarios, which scales with the product of sequence length and the number of parallel prediction positions, hindering scalability. To overcome this limitation, the authors extend EAGLE from autoregressive to parallel multi-token prediction by introducing a learnable shared hidden state mechanism. They further integrate precomputed attention masks and sequence chunking techniques, enabling gradient accumulation within a single sequence for the first time. This approach significantly improves training efficiency. Experiments on GPT-OSS 120B/20B and Qwen3-Coder 30B models demonstrate inference speedups of 1.10–1.36× compared to the autoregressive EAGLE-3 baseline.

Technology Category

Application Category

📝 Abstract
Reasoning LLMs produce longer outputs, requiring speculative decoding drafters trained on extended sequences. Parallel drafting - predicting multiple tokens per forward pass - offers latency benefits over sequential generation, but training complexity scales quadratically with the product of sequence length and parallel positions, rendering long-context training impractical. We present P(arallel)-EAGLE, which transforms EAGLE from autoregressive to parallel multi-token prediction via a learnable shared hidden state. To scale training to long contexts, we develop a framework featuring attention mask pre-computation and sequence partitioning techniques, enabling gradient accumulation within individual sequences for parallel-prediction training. We implement P-EAGLE in vLLM and demonstrate speedups of 1.10-1.36x over autoregressive EAGLE-3 across GPT-OSS 120B, 20B, and Qwen3-Coder 30B.
Problem

Research questions and friction points this paper is trying to address.

parallel drafting
speculative decoding
long-context training
training complexity
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel Drafting
Speculative Decoding
Long-context Training
Gradient Accumulation
Multi-token Prediction
🔎 Similar Papers
No similar papers found.