🤖 AI Summary
To address the slow inference speed and high deployment cost of large language models (LLMs), this paper proposes a novel speculative decoding (SD) architecture leveraging a Mixture of Attentions (MoA) in the small draft model. Unlike existing SD methods—whose draft policies are weakly strategic and operate under partially observable states—our approach is the first to integrate MoA into the draft model, significantly enhancing policy consistency and internal state observability. The framework supports both single-device acceleration and client–server collaborative decoding, enabling online verification, adaptive rollback, and disconnection-resilient generation. Experiments demonstrate that, on a single device, our method achieves a 9.5% speedup over EAGLE-2 and improves average acceptance length by 25%. In cross-device settings, it attains the lowest end-to-end latency and superior disconnection robustness, outperforming state-of-the-art API-based and SD baselines.
📝 Abstract
The growth in the number of parameters of Large Language Models (LLMs) has led to a significant surge in computational requirements, making them challenging and costly to deploy. Speculative decoding (SD) leverages smaller models to efficiently propose future tokens, which are then verified by the LLM in parallel. Small models that utilise activations from the LLM currently achieve the fastest decoding speeds. However, we identify several limitations of SD models including the lack of on-policyness during training and partial observability. To address these shortcomings, we propose a more grounded architecture for small models by introducing a Mixture of Attentions for SD. Our novel architecture can be applied in two scenarios: a conventional single device deployment and a novel client-server deployment where the small model is hosted on a consumer device and the LLM on a server. In a single-device scenario, we demonstrate state-of-the-art speedups improving EAGLE-2 by 9.5% and its acceptance length by 25%. In a client-server setting, our experiments demonstrate: 1) state-of-the-art latencies with minimal calls to the server for different network conditions, and 2) in the event of a complete disconnection, our approach can maintain higher accuracy compared to other SD methods and demonstrates advantages over API calls to LLMs, which would otherwise be unable to continue the generation process.