Steering the CensorShip: Uncovering Representation Vectors for LLM"Thought"Control

📅 2025-04-23

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This study investigates the intrinsic censorship mechanisms in large language models (LLMs) for safety alignment, characterizing them along two complementary dimensions: the overt “refuse–comply” response tendency and the latent “thought suppression”—a self-regulatory constraint operating within the model’s reasoning process. Method: We propose a representation-engineering-based directional vector probing framework to identify and quantify two orthogonal control vectors governing censorship behavior in open-source safety-finetuned models (e.g., DeepSeek-R1). Building on this, we introduce a negative-vector projection intervention paradigm enabling decomposable and reversible modulation of censorship. Results: Experiments demonstrate that our method significantly reduces refusal rates while restoring chain-of-thought reasoning capabilities—empirically validating the linear representability and controllability of censorship mechanisms in activation space. This work advances interpretability and controllable alignment of LLM safety mechanisms.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have transformed the way we access information. These models are often tuned to refuse to comply with requests that are considered harmful and to produce responses that better align with the preferences of those who control the models. To understand how this"censorship"works. We use representation engineering techniques to study open-weights safety-tuned models. We present a method for finding a refusal--compliance vector that detects and controls the level of censorship in model outputs. We also analyze recent reasoning LLMs, distilled from DeepSeek-R1, and uncover an additional dimension of censorship through"thought suppression". We show a similar approach can be used to find a vector that suppresses the model's reasoning process, allowing us to remove censorship by applying the negative multiples of this vector

Problem

Research questions and friction points this paper is trying to address.

Understanding how censorship works in LLMs

Finding refusal-compliance vectors for censorship control

Uncovering thought suppression in reasoning LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Representation engineering for censorship control

Refusal-compliance vector detects output censorship

Thought suppression vector removes reasoning censorship

🔎 Similar Papers

Uncovering Latent Chain of Thought Vectors in Language Models