Cross-Attention Speculative Decoding

πŸ“… 2025-05-30
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing speculative decoding methods predominantly rely on tightly coupled self-attention Transformer decoders, resulting in architectural complexity and poor generalizability. This paper proposes Beagleβ€”the first lightweight speculative decoding framework based exclusively on cross-attention mechanisms, fully decoupling the backbone language model from auxiliary modules. Our key contributions are: (1) introducing a novel cross-attention architecture that replaces conventional self-attention, eliminating the need for auxiliary decoders; and (2) proposing a two-stage block-level attention training strategy to jointly optimize training stability and convergence efficiency. Experiments demonstrate that Beagle achieves inference throughput comparable to EAGLE-v2, with higher training efficiency, constant memory footprint, and seamless transferability across diverse LLMs and datasets. These advances significantly enhance the generality, practicality, and deployment flexibility of speculative decoding.

Technology Category

Application Category

πŸ“ Abstract
Speculative decoding (SD) is a widely adopted approach for accelerating inference in large language models (LLMs), particularly when the draft and target models are well aligned. However, state-of-the-art SD methods typically rely on tightly coupled, self-attention-based Transformer decoders, often augmented with auxiliary pooling or fusion layers. This coupling makes them increasingly complex and harder to generalize across different models. We present Budget EAGLE (Beagle), the first, to our knowledge, cross-attention-based Transformer decoder SD model that achieves performance on par with leading self-attention SD models (EAGLE-v2) while eliminating the need for pooling or auxiliary components, simplifying the architecture, improving training efficiency, and maintaining stable memory usage during training-time simulation. To enable effective training of this novel architecture, we propose Two-Stage Block-Attention Training, a new method that achieves training stability and convergence efficiency in block-level attention scenarios. Extensive experiments across multiple LLMs and datasets show that Beagle achieves competitive inference speedups and higher training efficiency than EAGLE-v2, offering a strong alternative for architectures in speculative decoding.
Problem

Research questions and friction points this paper is trying to address.

Reduces complexity in speculative decoding architectures
Eliminates need for auxiliary components in SD models
Improves training efficiency and memory stability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-attention-based Transformer decoder SD model
Two-Stage Block-Attention Training method
Simplified architecture without auxiliary components
πŸ”Ž Similar Papers
No similar papers found.
Wei Zhong
Wei Zhong
Department of Statistics, Xiamen University
Statistics
Manasa Bharadwaj
Manasa Bharadwaj
Staff AI Research Scientist, LG Electronics Toronto AI Lab
NLPConversational AIGenerative AI
Y
Yixiao Wang
LG Electronics, Toronto AI Lab
N
Nikhil Verma
LG Electronics, Toronto AI Lab
Y
Yipeng Ji
LG Electronics, Toronto AI Lab
C
Chul Lee
LG Electronics, Toronto AI Lab