TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

📅 2026-04-06

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the severe memory bottleneck of KV caching in long-context inference, where existing compression methods suffer from inadequate query representation due to Rotary Position Embedding (RoPE), leading to inaccurate key selection and unstable generation. The study makes the novel observation that pre-RoPE query and key vectors exhibit strong non-zero-centered clustering and establishes their trigonometric relationship with positional distance preferences. Building on this insight, the authors propose a position- and norm-aware key importance scoring mechanism that enables stable and efficient KV compression in the pre-RoPE space, circumventing the limitations of post-RoPE attention-based approaches. Evaluated on the AIME25 32K-generation task, the method matches Full Attention accuracy while achieving 2.5× higher throughput or 10.7× less KV memory usage, substantially outperforming baselines and enabling deployment of long-context models on a single consumer-grade GPU.

Technology Category

Application Category

📝 Abstract

Extended reasoning in large language models (LLMs) creates severe KV cache memory bottlenecks. Leading KV cache compression methods estimate KV importance using attention scores from recent post-RoPE queries. However, queries rotate with position during RoPE, making representative queries very few, leading to poor top-key selection and unstable reasoning. To avoid this issue, we turn to the pre-RoPE space, where we observe that Q and K vectors are highly concentrated around fixed non-zero centers and remain stable across positions -- Q/K concentration. We show that this concentration causes queries to preferentially attend to keys at specific distances (e.g., nearest keys), with the centers determining which distances are preferred via a trigonometric series. Based on this, we propose TriAttention to estimate key importance by leveraging these centers. Via the trigonometric series, we use the distance preference characterized by these centers to score keys according to their positions, and also leverage Q/K norms as an additional signal for importance estimation. On AIME25 with 32K-token generation, TriAttention matches Full Attention reasoning accuracy while achieving 2.5x higher throughput or 10.7x KV memory reduction, whereas leading baselines achieve only about half the accuracy at the same efficiency. TriAttention enables OpenClaw deployment on a single consumer GPU, where long context would otherwise cause out-of-memory with Full Attention.

Problem

Research questions and friction points this paper is trying to address.

KV cache compression

long-context reasoning

RoPE

attention mechanism

memory bottleneck

Innovation

Methods, ideas, or system contributions that make the work stand out.

TriAttention

KV cache compression

Q/K concentration