Slow-Fast Inference: Training-Free Inference Acceleration via Within-Sentence Support Stability

📅 2026-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the high computational cost of long-context autoregressive decoding, which stems from repeatedly processing increasingly lengthy historical sequences. The authors propose a training-free decoding framework that decouples generation into frequent, low-overhead fast steps and occasional, compute-intensive slow steps employing full attention. To ensure inference stability, they introduce intra-sentence attention for the first time. By integrating sparse memory reuse, semantic boundary detection, and dynamic switching between fast and slow steps, the method can be directly applied to existing model checkpoints without fine-tuning. Experiments demonstrate throughput improvements of 1.6–14.4× across various context lengths while maintaining generation quality on par with the full KV attention baseline.

Technology Category

Application Category

📝 Abstract
Long-context autoregressive decoding remains expensive because each decoding step must repeatedly process a growing history. We observe a consistent pattern during decoding: within a sentence, and more generally within a short semantically coherent span, the dominant attention support often remains largely stable. Motivated by this observation, we propose Slow-Fast Inference (SFI), a training-free decoding framework that decouples generation into frequent low-cost fast steps and occasional dense-attention slow steps. Fast steps reuse a compact sparse memory for efficient decoding. Slow steps are triggered near semantic boundaries. At slow steps, the model revisits the broader context and uses the Selector to refresh the selected memory for subsequent fast steps. Across the evaluated context lengths, SFI delivers approximately $1.6\times$--$14.4\times$ higher decoding throughput while generally maintaining quality on par with the full-KV baseline across long-context and long-CoT settings. Because SFI is training-free and applies directly to existing checkpoints, it offers a practical path to reducing inference cost for contemporary autoregressive reasoning models in long-context, long-horizon, and agentic workloads.
Problem

Research questions and friction points this paper is trying to address.

long-context decoding
autoregressive inference
inference acceleration
attention stability
decoding efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Slow-Fast Inference
training-free acceleration
attention sparsity
long-context decoding
memory reuse
🔎 Similar Papers
No similar papers found.