🤖 AI Summary
KV caching constitutes a critical memory bottleneck for large language models handling long contexts, yet its semantic structure has remained largely uninterpretable. This work introduces Top-K Sparse Transformer Autoencoders (STA-Attention), the first method to decouple KV caches into sparse routing keys and dense content-bearing values, exposing their fundamental asymmetry. We identify a “semantic elbow point” phenomenon—a principled criterion for selecting optimal sparsity—and propose a dual-budget sparsification strategy that preserves attention’s geometric structure while ensuring semantic fidelity. Evaluated on Yi-6B, Mistral-7B, and Qwen2.5-32B, STA-Attention achieves interpretable decomposition and efficient reconstruction of KV caches without degrading perplexity or zero-shot performance. The approach simultaneously ensures modeling fidelity and semantic transparency, advancing both efficiency and interpretability in long-context inference.
📝 Abstract
The Key-Value (KV) cache is the primary memory bottleneck in long-context Large Language Models, yet it is typically treated as an opaque numerical tensor. In this work, we propose extbf{STA-Attention}, a framework that utilizes Top-K Sparse Autoencoders (SAEs) to decompose the KV cache into interpretable ``semantic atoms.'' Unlike standard $L_1$-regularized SAEs, our Top-K approach eliminates shrinkage bias, preserving the precise dot-product geometry required for attention. Our analysis uncovers a fundamental extbf{Key-Value Asymmetry}: while Key vectors serve as highly sparse routers dominated by a ``Semantic Elbow,'' deep Value vectors carry dense content payloads requiring a larger budget. Based on this structure, we introduce a Dual-Budget Strategy that selectively preserves the most informative semantic components while filtering representational noise. Experiments on Yi-6B, Mistral-7B, Qwen2.5-32B, and others show that our semantic reconstructions maintain perplexity and zero-shot performance comparable to the original models, effectively bridging the gap between mechanistic interpretability and faithful attention modeling.