SOM Directions are Better than One: Multi-Directional Refusal Suppression in Language Models

📅 2025-11-11

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Existing safety-aligned language models typically model refusal behavior as a single direction in latent space, yet theoretical and empirical evidence suggests that refusal concepts are more likely distributed across a low-dimensional manifold. Method: This paper introduces self-organizing maps (SOMs) to refusal analysis for the first time, automatically learning multiple refusal directions from hidden states of harmful and harmless prompts. It further validates the non-unidirectional nature of refusal via prompt representation clustering, differential centroid computation, and multi-directional internal activation ablation. Contribution/Results: Experiments demonstrate that multi-directional ablation substantially outperforms unidirectional baselines and state-of-the-art jailbreaking methods, effectively suppressing refusal while preserving generation quality. This yields improved model controllability and higher-precision safety interventions, revealing the intrinsic manifold structure underlying refusal behavior.

Technology Category

Application Category

📝 Abstract

Refusal refers to the functional behavior enabling safety-aligned language models to reject harmful or unethical prompts. Following the growing scientific interest in mechanistic interpretability, recent work encoded refusal behavior as a single direction in the model's latent space; e.g., computed as the difference between the centroids of harmful and harmless prompt representations. However, emerging evidence suggests that concepts in LLMs often appear to be encoded as a low-dimensional manifold embedded in the high-dimensional latent space. Motivated by these findings, we propose a novel method leveraging Self-Organizing Maps (SOMs) to extract multiple refusal directions. To this end, we first prove that SOMs generalize the prior work's difference-in-means technique. We then train SOMs on harmful prompt representations to identify multiple neurons. By subtracting the centroid of harmless representations from each neuron, we derive a set of multiple directions expressing the refusal concept. We validate our method on an extensive experimental setup, demonstrating that ablating multiple directions from models'internals outperforms not only the single-direction baseline but also specialized jailbreak algorithms, leading to an effective suppression of refusal. Finally, we conclude by analyzing the mechanistic implications of our approach.

Problem

Research questions and friction points this paper is trying to address.

Suppressing refusal behavior in safety-aligned language models

Extracting multi-directional refusal representations using Self-Organizing Maps

Improving refusal suppression beyond single-direction ablation methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Self-Organizing Maps to extract multiple refusal directions

Identifies refusal neurons from harmful prompt representations

Ablates multiple directions to suppress refusal behavior

🔎 Similar Papers

No similar papers found.