SinkRouter: Sink-Aware Routing for Efficient Long-Context Decoding in Large Language and Multimodal Models

📅 2026-04-18

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the memory bandwidth bottleneck imposed by KV caching in long-context decoding, where existing acceleration methods often compromise accuracy and lack a deep understanding of the “attention sink” phenomenon. The authors propose SinkRouter, a training-free routing framework that, for the first time, characterizes attention sinks as stable, reachable, and error-bounded fixed points formed during training. Building on this insight, SinkRouter introduces a sink-aware selective computation skipping mechanism. By integrating training-agnostic sink detection, block-level branch control, Split-K parallelism, and custom Triton GPU kernels, SinkRouter achieves up to 2.03× decoding speedup on 512K-context workloads across multiple models and benchmarks while preserving original model accuracy.

Technology Category

Application Category

📝 Abstract

In long-context decoding for LLMs and LMMs, attention becomes increasingly memory-bound because each decoding step must load a large amount of KV-cache data from GPU memory. Existing acceleration strategies often trade efficiency for accuracy by relying on heuristic pruning that may discard useful information. At a deeper level, they also tend to indiscriminately preserve all high-scoring tokens, treat early tokens as indispensable anchors, or rely on heuristic head routing, reflecting an insufficient mechanistic understanding of the attention sink phenomenon. In this paper, we show that the attention sink phenomenon corresponds to a stable, reachable, and error-controllable fixed point constructed during training. Based on this insight, we propose SinkRouter, a training-free selective routing framework that detects the sink signal and skips computations that would otherwise produce near-zero output. To translate this mechanism into real-world acceleration, we develop a hardware-aware Triton kernel with block-level branching and Split-K parallelism. We conduct extensive evaluations on a diverse suite of long-context benchmarks, including LongBench, InfiniteBench, CVBench, MileBench, and MMVP, using both text-only and multimodal backbones such as Llama-3.1-8B, Llama-3.1-70B, Yi-9B-200K, LLaVA-1.5-7B, and LLaVA-1.5-13B. Across these settings, SinkRouter consistently improves decoding efficiency while maintaining competitive accuracy, and reaches 2.03x speedup with a 512K context.

Problem

Research questions and friction points this paper is trying to address.

long-context decoding

attention sink

KV-cache

memory-bound attention

efficient inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

attention sink

training-free routing

KV-cache optimization