Unlocking the Address Book: Dissecting the Sparse Semantic Structure of LLM Key-Value Caches via Sparse Autoencoders

📅 2025-12-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
KV caching constitutes a critical memory bottleneck for large language models handling long contexts, yet its semantic structure has remained largely uninterpretable. This work introduces Top-K Sparse Transformer Autoencoders (STA-Attention), the first method to decouple KV caches into sparse routing keys and dense content-bearing values, exposing their fundamental asymmetry. We identify a “semantic elbow point” phenomenon—a principled criterion for selecting optimal sparsity—and propose a dual-budget sparsification strategy that preserves attention’s geometric structure while ensuring semantic fidelity. Evaluated on Yi-6B, Mistral-7B, and Qwen2.5-32B, STA-Attention achieves interpretable decomposition and efficient reconstruction of KV caches without degrading perplexity or zero-shot performance. The approach simultaneously ensures modeling fidelity and semantic transparency, advancing both efficiency and interpretability in long-context inference.

Technology Category

Application Category

📝 Abstract
The Key-Value (KV) cache is the primary memory bottleneck in long-context Large Language Models, yet it is typically treated as an opaque numerical tensor. In this work, we propose extbf{STA-Attention}, a framework that utilizes Top-K Sparse Autoencoders (SAEs) to decompose the KV cache into interpretable ``semantic atoms.'' Unlike standard $L_1$-regularized SAEs, our Top-K approach eliminates shrinkage bias, preserving the precise dot-product geometry required for attention. Our analysis uncovers a fundamental extbf{Key-Value Asymmetry}: while Key vectors serve as highly sparse routers dominated by a ``Semantic Elbow,'' deep Value vectors carry dense content payloads requiring a larger budget. Based on this structure, we introduce a Dual-Budget Strategy that selectively preserves the most informative semantic components while filtering representational noise. Experiments on Yi-6B, Mistral-7B, Qwen2.5-32B, and others show that our semantic reconstructions maintain perplexity and zero-shot performance comparable to the original models, effectively bridging the gap between mechanistic interpretability and faithful attention modeling.
Problem

Research questions and friction points this paper is trying to address.

Decompose KV cache into interpretable semantic atoms
Address Key-Value asymmetry in sparse semantic structure
Selectively preserve informative components while filtering noise
Innovation

Methods, ideas, or system contributions that make the work stand out.

Top-K Sparse Autoencoders decompose KV cache into semantic atoms
Dual-Budget Strategy selectively preserves informative semantic components
Framework maintains model performance while enabling interpretable attention modeling
Q
Qingsen Ma
Beijing University of Posts and Telecommunications, Beijing, China
D
Dianyun Wang
Beijing University of Posts and Telecommunications, Beijing, China
J
Jiaming Lyu
Beijing University of Posts and Telecommunications, Beijing, China
Y
Yaoye Wang
Beijing University of Posts and Telecommunications, Beijing, China
L
Lechen Ning
Beijing University of Posts and Telecommunications, Beijing, China
S
Sujie Zhu
Beijing University of Posts and Telecommunications, Beijing, China
Z
Zhenbo Xu
Beijing University of Posts and Telecommunications, Beijing, China
Liuyu Xiang
Liuyu Xiang
Beijing University of Posts and Telecommunications
Computer VisionReinforcement LearningLLM Agent
H
Huining Li
Baidu Inc., Beijing, China
H
Huijia Wu
Beijing University of Posts and Telecommunications, Beijing, China
Z
Zhaofeng He
Beijing University of Posts and Telecommunications, Beijing, China