The Cylindrical Representation Hypothesis for Language Model Steering

📅 2026-05-03
📈 Citations: 0
Influential: 0
📄 PDF

career value

215K/year
🤖 AI Summary
This work addresses the unpredictable fluctuations often observed in steering control of large language models, which existing theories based on linear orthogonality assumptions fail to explain. The authors propose the Cylindrical Representation Hypothesis (CRH), which abandons orthogonality constraints and posits that conceptual representations are structured around a dominant axis surrounded by a normal plane containing sectors that are either sensitive or resistant to steering. Through concept difference vector analysis, geometric modeling, and empirical identification of sensitive sectors, the study provides the first evidence of an intrinsic cylindrical geometry within language models. CRH not only accounts for the instability of steering behaviors but also establishes a novel theoretical framework for controllability, substantially improving the predictability of steering outcomes in real-world scenarios.
📝 Abstract
Steering is a widely used technique for controlling large language models, yet its effects are often unstable and hard to predict. Existing theoretical accounts are largely based on the Linear Representation Hypothesis (LRH). While LRH assumes that concepts can be orthogonalized for lossless control, this idealized mapping fails in real representations and cannot account for the observed unpredictability of steering. By relaxing LRH's orthogonality assumption while preserving linear representations, we show that overlapping concept contributions naturally yield a sample-specific axis-orthogonal structure. We formalize this as the Cylindrical Representation Hypothesis (CRH). In CRH, a central axis captures the main difference between concept absence and presence and drives concept generation. A surrounding normal plane controls steering sensitivity by determining how easily the axis can activate the target concept. Within this plane, only specific sensitive sectors strongly facilitate concept activation, while other sectors can suppress or delay it. While the surrounding normal plane can be reliably identified from difference vectors, the sensitive sector cannot, introducing intrinsic uncertainty at the sector level. This uncertainty provides a principled explanation for why steering outcomes often fluctuate even when using well-aligned directions. Our experiments verify the existence of the cylindrical structure and demonstrate that CRH provides a valid and practical way to interpret model steering behavior in real settings: https://github.com/mbzuai-nlp/CRH.
Problem

Research questions and friction points this paper is trying to address.

language model steering
representation geometry
concept activation
steering instability
orthogonality assumption
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cylindrical Representation Hypothesis
model steering
representation geometry
concept activation
steering uncertainty
Lang Gao
Lang Gao
MBZUAI
Mechanistic InterpretabilityNatural Language Processing
Jinghui Zhang
Jinghui Zhang
MBZUAI, Prev. Shandong University
NLP
Wei Liu
Wei Liu
National University of Singapore
machine learning
Fengxian Ji
Fengxian Ji
Northeast University
agent、Machine learnin、CV
C
Chenxi Wang
Department of Natural Language Processing, Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
Zirui Song
Zirui Song
PhD student in MBZUAI
NLP
A
Akash Ghosh
Department of Natural Language Processing, Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
Y
Youssef Mohamed
Department of Natural Language Processing, Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
Preslav Nakov
Preslav Nakov
Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)
Computational LinguisticsLarge Language ModelsFact-checkingFake News
Xiuying Chen
Xiuying Chen
MBZUAI
Trustworthy NLPHuman-Centered NLPComputational Social Science