🤖 AI Summary
Autoregressive inference in large language models incurs substantial verification overhead in long-sequence scenarios. To address this, this work proposes TriSpec, a ternary speculative decoding framework that optimizes speculative decoding by explicitly reducing verification costs. TriSpec introduces a lightweight proxy module that directly accepts high-confidence draft tokens without invoking the target model, resorts to rejection for low-confidence tokens, and only consults the target model when uncertainty is detected—thereby enabling a three-way decision mechanism. This approach significantly reduces the number of target model invocations while preserving generation quality. Experiments on Qwen3 and DeepSeek-R1-Distill-Qwen/LLaMA series models demonstrate that TriSpec achieves up to 35% faster inference and reduces target model calls by as much as 50% compared to standard speculative decoding.
📝 Abstract
Inference efficiency in Large Language Models (LLMs) is fundamentally limited by their serial, autoregressive generation, especially as reasoning becomes a key capability and response sequences grow longer. Speculative decoding (SD) offers a powerful solution, providing significant speed-ups through its lightweight drafting and parallel verification mechanism. While existing work has nearly saturated improvements in draft effectiveness and efficiency, this paper advances SD from a new yet critical perspective: the verification cost. We propose TriSpec, a novel ternary SD framework that, at its core, introduces a lightweight proxy to significantly reduce computational cost by approving easily verifiable draft sequences and engaging the full target model only when encountering uncertain tokens. TriSpec can be integrated with state-of-the-art SD methods like EAGLE-3 to further reduce verification costs, achieving greater acceleration. Extensive experiments on the Qwen3 and DeepSeek-R1-Distill-Qwen/LLaMA families show that TriSpec achieves up to 35\% speedup over standard SD, with up to 50\% fewer target model invocations while maintaining comparable accuracy.