Embracing Anisotropy: Turning Massive Activations into Interpretable Control Knobs for Large Language Models

📅 2026-02-04
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF

career value

213K/year
🤖 AI Summary
This work challenges the prevailing view that the highly anisotropic internal representations of large language models—characterized by a few dimensions exhibiting substantially higher activation than others—are inherently defective. Instead, it reframes these high-activation dimensions as interpretable semantic functional units. The authors introduce a novel, training-free magnitude-thresholding method to identify domain-critical dimensions and propose a “Critical Dimension Steering” paradigm that achieves precise semantic control by intervening exclusively on these key dimensions. Experimental results demonstrate that this approach significantly outperforms full-dimensional manipulation in both domain adaptation and jailbreak defense tasks, offering superior efficiency and interpretability.
📝 Abstract
Large Language Models (LLMs) exhibit highly anisotropic internal representations, often characterized by massive activations, a phenomenon where a small subset of feature dimensions possesses magnitudes significantly larger than the rest. While prior works view these extreme dimensions primarily as artifacts to be managed, we propose a distinct perspective: these dimensions serve as intrinsic interpretable functional units arising from domain specialization. Specifically, we propose a simple magnitude-based criterion to identify Domain-Critical Dimensions in a training-free manner. Our analyses reveal that such dimensions behave as interpretable semantic detectors for symbolic/quantitative patterns or domain-specific terms. In addition, we introduce Critical Dimension Steering, which applies activation steering exclusively to the identified dimensions. Empirical results show that this approach outperforms conventional whole-dimension steering in domain adaptation and jailbreaking scenarios.
Problem

Research questions and friction points this paper is trying to address.

anisotropy
massive activations
interpretable control
large language models
domain adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

anisotropy
massive activations
interpretable control
domain-critical dimensions
activation steering