State Rank Dynamics in Linear Attention LLMs

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
The internal dynamics of state matrices in linear attention–based large language models remain poorly understood, hindering deeper insight into model behavior and optimization. This work reveals for the first time that linear attention heads inherently exhibit distinct low-rank and high-rank structures with functional specialization: low-rank heads predominantly drive reasoning, while high-rank heads are largely redundant. Building on this finding, we introduce a joint rank–norm pruning strategy that reduces KV cache overhead by 38.9% under zero-shot settings with minimal impact on model accuracy. Our approach integrates spectral analysis, runtime state-rank tracking, and diagnostic probing to systematically characterize the hierarchical evolution of state rank during inference.

Technology Category

Application Category

📝 Abstract
Linear Attention Large Language Models (LLMs) offer a compelling recurrent formulation that compresses context into a fixed-size state matrix, enabling constant-time inference. However, the internal dynamics of this compressed state remain largely opaque. In this work, we present a comprehensive study on the runtime state dynamics of state-of-the-art Linear Attention models. We uncover a fundamental phenomenon termed State Rank Stratification, characterized by a distinct spectral bifurcation among linear attention heads: while one group maintains an effective rank oscillating near zero, the other exhibits rapid growth that converges to an upper bound. Extensive experiments across diverse inference contexts reveal that these dynamics remain strikingly consistent, indicating that the identity of a head,whether low-rank or high-rank,is an intrinsic structural property acquired during pre-training, rather than a transient state dependent on the input data. Furthermore, our diagnostic probes reveal a surprising functional divergence: low-rank heads are indispensable for model reasoning, whereas high-rank heads exhibit significant redundancy. Leveraging this insight, we propose Joint Rank-Norm Pruning, a zero-shot strategy that achieves a 38.9\% reduction in KV-cache overhead while largely maintaining model accuracy.
Problem

Research questions and friction points this paper is trying to address.

Linear Attention
State Rank
Rank Stratification
LLMs
KV-cache
Innovation

Methods, ideas, or system contributions that make the work stand out.

State Rank Stratification
Linear Attention
KV-cache pruning
Zero-shot compression
Attention head redundancy
🔎 Similar Papers
No similar papers found.