Steering Language Models with Weight Arithmetic

📅 2025-11-07

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This paper addresses the challenges of fine-tuning large language models (LLMs) on limited, narrow-distribution data—specifically, sycophancy, false refusals, and behavioral drift. We propose Contrastive Weight Steering (CWS), a post-training model editing method that explicitly extracts additive/subtractive behavioral direction vectors in weight space by computing the weight difference from small-scale contrastive fine-tuning. Crucially, CWS introduces contrastive learning into weight-space model editing for the first time, requiring no access to activations or gradients while enabling cross-distribution behavioral control. Experiments demonstrate that CWS significantly suppresses sycophantic responses and false refusals, mitigates behavioral drift during fine-tuning, preserves task performance, and supports alignment-state monitoring and anomaly evolution detection. The approach establishes a novel paradigm for generalization control under narrow-distribution training.

Technology Category

Application Category

📝 Abstract

Providing high-quality feedback to Large Language Models (LLMs) on a diverse training distribution can be difficult and expensive, and providing feedback only on a narrow distribution can result in unintended generalizations. To better leverage narrow training data, we propose contrastive weight steering, a simple post-training method that edits the model parameters using weight arithmetic. We isolate a behavior direction in weight-space by subtracting the weight deltas from two small fine-tunes -- one that induces the desired behavior and another that induces its opposite -- and then add or remove this direction to modify the model's weights. We apply this technique to mitigate sycophancy and induce misalignment, and find that weight steering often generalizes further than activation steering, achieving stronger out-of-distribution behavioral control before degrading general capabilities. We also show that, in the context of task-specific fine-tuning, weight steering can partially mitigate undesired behavioral drift: it can reduce sycophancy and under-refusals introduced during fine-tuning while preserving task performance gains. Finally, we provide preliminary evidence that emergent misalignment can be detected by measuring the similarity between fine-tuning updates and an"evil"weight direction, suggesting that it may be possible to monitor the evolution of weights during training and detect rare misaligned behaviors that never manifest during training or evaluations.

Problem

Research questions and friction points this paper is trying to address.

Mitigating unintended generalizations from narrow training data distributions

Reducing sycophancy and behavioral drift during task-specific fine-tuning

Detecting emergent misalignment through weight similarity measurements

Innovation

Methods, ideas, or system contributions that make the work stand out.

Contrastive weight steering edits model parameters

Isolates behavior direction via weight delta subtraction

Modifies weights to enhance out-of-distribution control

🔎 Similar Papers

Analyzing the Generalization and Reliability of Steering Vectors