🤖 AI Summary
This work addresses the high computational cost of long-context autoregressive decoding, which stems from repeatedly processing increasingly lengthy historical sequences. The authors propose a training-free decoding framework that decouples generation into frequent, low-overhead fast steps and occasional, compute-intensive slow steps employing full attention. To ensure inference stability, they introduce intra-sentence attention for the first time. By integrating sparse memory reuse, semantic boundary detection, and dynamic switching between fast and slow steps, the method can be directly applied to existing model checkpoints without fine-tuning. Experiments demonstrate throughput improvements of 1.6–14.4× across various context lengths while maintaining generation quality on par with the full KV attention baseline.
📝 Abstract
Long-context autoregressive decoding remains expensive because each decoding step must repeatedly process a growing history. We observe a consistent pattern during decoding: within a sentence, and more generally within a short semantically coherent span, the dominant attention support often remains largely stable. Motivated by this observation, we propose Slow-Fast Inference (SFI), a training-free decoding framework that decouples generation into frequent low-cost fast steps and occasional dense-attention slow steps. Fast steps reuse a compact sparse memory for efficient decoding. Slow steps are triggered near semantic boundaries. At slow steps, the model revisits the broader context and uses the Selector to refresh the selected memory for subsequent fast steps. Across the evaluated context lengths, SFI delivers approximately $1.6\times$--$14.4\times$ higher decoding throughput while generally maintaining quality on par with the full-KV baseline across long-context and long-CoT settings. Because SFI is training-free and applies directly to existing checkpoints, it offers a practical path to reducing inference cost for contemporary autoregressive reasoning models in long-context, long-horizon, and agentic workloads.