Unlocking the Address Book: Dissecting the Sparse Semantic Structure of LLM Key-Value Caches via Sparse Autoencoders

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

239K/year

🤖 AI Summary

KV caching constitutes a critical memory bottleneck for large language models handling long contexts, yet its semantic structure has remained largely uninterpretable. This work introduces Top-K Sparse Transformer Autoencoders (STA-Attention), the first method to decouple KV caches into sparse routing keys and dense content-bearing values, exposing their fundamental asymmetry. We identify a “semantic elbow point” phenomenon—a principled criterion for selecting optimal sparsity—and propose a dual-budget sparsification strategy that preserves attention’s geometric structure while ensuring semantic fidelity. Evaluated on Yi-6B, Mistral-7B, and Qwen2.5-32B, STA-Attention achieves interpretable decomposition and efficient reconstruction of KV caches without degrading perplexity or zero-shot performance. The approach simultaneously ensures modeling fidelity and semantic transparency, advancing both efficiency and interpretability in long-context inference.

Technology Category

Application Category

📝 Abstract

The Key-Value (KV) cache is the primary memory bottleneck in long-context Large Language Models, yet it is typically treated as an opaque numerical tensor. In this work, we propose extbf{STA-Attention}, a framework that utilizes Top-K Sparse Autoencoders (SAEs) to decompose the KV cache into interpretable ``semantic atoms.'' Unlike standard $L_1$-regularized SAEs, our Top-K approach eliminates shrinkage bias, preserving the precise dot-product geometry required for attention. Our analysis uncovers a fundamental extbf{Key-Value Asymmetry}: while Key vectors serve as highly sparse routers dominated by a ``Semantic Elbow,'' deep Value vectors carry dense content payloads requiring a larger budget. Based on this structure, we introduce a Dual-Budget Strategy that selectively preserves the most informative semantic components while filtering representational noise. Experiments on Yi-6B, Mistral-7B, Qwen2.5-32B, and others show that our semantic reconstructions maintain perplexity and zero-shot performance comparable to the original models, effectively bridging the gap between mechanistic interpretability and faithful attention modeling.

Problem

Research questions and friction points this paper is trying to address.

Decompose KV cache into interpretable semantic atoms

Address Key-Value asymmetry in sparse semantic structure

Selectively preserve informative components while filtering noise

Innovation

Methods, ideas, or system contributions that make the work stand out.

Top-K Sparse Autoencoders decompose KV cache into semantic atoms

Dual-Budget Strategy selectively preserves informative semantic components

Framework maintains model performance while enabling interpretable attention modeling

🔎 Similar Papers

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models