🤖 AI Summary
This work addresses the high inference latency of large language models caused by autoregressive decoding and the limitations of speculative decoding due to the lack of high-quality draft models and scalable training infrastructure. The authors propose a target-draft decoupled training strategy, integrating hybrid parallelism and customized training kernels to build the first efficient open-source training framework supporting EAGLE-3. They also release SpecBundle, a suite of high-quality draft models covering mainstream large language models. This framework enables, for the first time, efficient EAGLE-3 training on ultra-large-scale models such as Qwen3-235B-A22B, achieving a 9.9× speedup in training and a 4.48× end-to-end inference acceleration. Furthermore, it is deeply integrated with production-grade inference engines like SGLang, significantly advancing the practical deployment of speculative decoding.
📝 Abstract
Large language models incur high inference latency due to sequential autoregressive decoding. Speculative decoding alleviates this bottleneck by using a lightweight draft model to propose multiple tokens for batched verification. However, its adoption has been limited by the lack of high-quality draft models and scalable training infrastructure. We introduce SpecForge, an open-source, production-oriented framework for training speculative decoding models with full support for EAGLE-3. SpecForge incorporates target-draft decoupling, hybrid parallelism, optimized training kernels, and integration with production-grade inference engines, enabling up to 9.9x faster EAGLE-3 training for Qwen3-235B-A22B. In addition, we release SpecBundle, a suite of production-grade EAGLE-3 draft models trained with SpecForge for mainstream open-source LLMs. Through a systematic study of speculative decoding training recipes, SpecBundle addresses the scarcity of high-quality drafts in the community, and our draft models achieve up to 4.48x end-to-end inference speedup on SGLang, establishing SpecForge as a practical foundation for real-world speculative decoding deployment.