There Is More to Refusal in Large Language Models than a Single Direction

📅 2026-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether refusal behaviors in large language models are governed by a single activation direction. Through geometric analysis of the activation space and linear intervention techniques, the authors systematically examine the representational structure underlying eleven categories of refusal and noncompliant behaviors. The findings reveal that these behaviors correspond to multiple distinct activation directions, indicating directional diversity in refusal mechanisms. Nevertheless, manipulating any one of these directions produces a similar trade-off between appropriate refusal and over-refusal, suggesting the presence of a shared one-dimensional control mechanism. This work is the first to demonstrate that refusal behavior is co-regulated by multiple directions, with different directions primarily modulating how refusal is expressed rather than whether it occurs, thereby significantly advancing the understanding of model safety mechanisms.

Technology Category

Application Category

📝 Abstract
Prior work argues that refusal in large language models is mediated by a single activation-space direction, enabling effective steering and ablation. We show that this account is incomplete. Across eleven categories of refusal and non-compliance, including safety, incomplete or unsupported requests, anthropomorphization, and over-refusal, we find that these refusal behaviors correspond to geometrically distinct directions in activation space. Yet despite this diversity, linear steering along any refusal-related direction produces nearly identical refusal to over-refusal trade-offs, acting as a shared one-dimensional control knob. The primary effect of different directions is not whether the model refuses, but how it refuses.
Problem

Research questions and friction points this paper is trying to address.

refusal
large language models
activation space
steering
non-compliance
Innovation

Methods, ideas, or system contributions that make the work stand out.

refusal directions
activation space
linear steering
large language models
behavioral diversity
🔎 Similar Papers
No similar papers found.
F
Faaiz Joad
Qatar Computing Research Institute, HBKU, Doha, Qatar
Majd Hawasly
Majd Hawasly
QCRI, Hamad Bin Khalifa University
Autonomous systemsLifelong learningNatural Language Processing
S
Sabri Boughorbel
Qatar Computing Research Institute, HBKU, Doha, Qatar
Nadir Durrani
Nadir Durrani
Senior Scientist, QCRI, HBKU
Machine TranslationInterpretabilityTransliterationWord SegmentationNatural Language Processing
H
H. Sencar
Qatar Computing Research Institute, HBKU, Doha, Qatar