TriSpec: Ternary Speculative Decoding via Lightweight Proxy Verification

📅 2026-01-30

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Autoregressive inference in large language models incurs substantial verification overhead in long-sequence scenarios. To address this, this work proposes TriSpec, a ternary speculative decoding framework that optimizes speculative decoding by explicitly reducing verification costs. TriSpec introduces a lightweight proxy module that directly accepts high-confidence draft tokens without invoking the target model, resorts to rejection for low-confidence tokens, and only consults the target model when uncertainty is detected—thereby enabling a three-way decision mechanism. This approach significantly reduces the number of target model invocations while preserving generation quality. Experiments on Qwen3 and DeepSeek-R1-Distill-Qwen/LLaMA series models demonstrate that TriSpec achieves up to 35% faster inference and reduces target model calls by as much as 50% compared to standard speculative decoding.

Technology Category

Application Category

📝 Abstract

Inference efficiency in Large Language Models (LLMs) is fundamentally limited by their serial, autoregressive generation, especially as reasoning becomes a key capability and response sequences grow longer. Speculative decoding (SD) offers a powerful solution, providing significant speed-ups through its lightweight drafting and parallel verification mechanism. While existing work has nearly saturated improvements in draft effectiveness and efficiency, this paper advances SD from a new yet critical perspective: the verification cost. We propose TriSpec, a novel ternary SD framework that, at its core, introduces a lightweight proxy to significantly reduce computational cost by approving easily verifiable draft sequences and engaging the full target model only when encountering uncertain tokens. TriSpec can be integrated with state-of-the-art SD methods like EAGLE-3 to further reduce verification costs, achieving greater acceleration. Extensive experiments on the Qwen3 and DeepSeek-R1-Distill-Qwen/LLaMA families show that TriSpec achieves up to 35\% speedup over standard SD, with up to 50\% fewer target model invocations while maintaining comparable accuracy.

Problem

Research questions and friction points this paper is trying to address.

speculative decoding

verification cost

inference efficiency

large language models

autoregressive generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

speculative decoding

lightweight proxy verification

ternary decoding