Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding

📅 2025-11-30

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

Large reasoning models face memory-bandwidth bottlenecks during long chain-of-thought (CoT) generation: the KV cache grows linearly with sequence length, causing attention computation to be memory-bound rather than compute-bound. This work introduces SparseSpec—the first system-level optimization framework integrating speculative decoding with PillarAttn’s sparse attention. Its key contributions are: (1) leveraging information reuse from the verification stage to precisely identify and retain critical tokens for sparse attention, and (2) co-designing unified scheduling, latency-aware verification, and dynamic KV-cache management to overlap computation with memory access. Evaluated across multiple large language models and CoT benchmarks, SparseSpec achieves up to a 2.13× throughput improvement over state-of-the-art methods, with significant gains in both latency and memory efficiency.

Technology Category

Application Category

📝 Abstract

Reasoning language models have demonstrated remarkable capabilities on challenging tasks by generating elaborate chain-of-thought (CoT) solutions. However, such lengthy generation shifts the inference bottleneck from compute-bound to memory-bound. To generate each token, the model applies full attention to all previously generated tokens, requiring memory access to an increasingly large KV-Cache. Consequently, longer generations demand more memory access for every step, leading to substantial pressure on memory bandwidth. To address this, we introduce SparseSpec, a speculative decoding framework that reuses the same model as the draft and target models (i.e., self-speculation). SparseSpec features a novel sparse attention mechanism, PillarAttn, as the draft model, which accurately selects critical tokens via elegantly reusing information from the verification stage. Furthermore, SparseSpec co-designs self-speculation with three system innovations: (1) a unified scheduler to batch token drafting and verification, (2) delayed verification for CPU/GPU overlap, and (3) dynamic KV-Cache management to maximize memory utilization. Across various models and datasets, SparseSpec outperforms state-of-the-art solutions, with an up to 2.13x throughput speedup.

Problem

Research questions and friction points this paper is trying to address.

Reduces memory bottleneck in large reasoning model inference

Accelerates generation by reusing model for speculative decoding

Improves throughput via sparse attention and system co-design

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-speculative decoding framework reuses same model for draft and target

Sparse attention mechanism selects critical tokens via verification stage reuse

Co-designs system innovations like unified scheduler and dynamic KV-Cache management

🔎 Similar Papers

Semantic Self-Consistency: Enhancing Language Model Reasoning via Semantic Weighting