🤖 AI Summary
This paper addresses the challenges of fine-tuning large language models (LLMs) on limited, narrow-distribution data—specifically, sycophancy, false refusals, and behavioral drift. We propose Contrastive Weight Steering (CWS), a post-training model editing method that explicitly extracts additive/subtractive behavioral direction vectors in weight space by computing the weight difference from small-scale contrastive fine-tuning. Crucially, CWS introduces contrastive learning into weight-space model editing for the first time, requiring no access to activations or gradients while enabling cross-distribution behavioral control. Experiments demonstrate that CWS significantly suppresses sycophantic responses and false refusals, mitigates behavioral drift during fine-tuning, preserves task performance, and supports alignment-state monitoring and anomaly evolution detection. The approach establishes a novel paradigm for generalization control under narrow-distribution training.
📝 Abstract
Providing high-quality feedback to Large Language Models (LLMs) on a diverse training distribution can be difficult and expensive, and providing feedback only on a narrow distribution can result in unintended generalizations. To better leverage narrow training data, we propose contrastive weight steering, a simple post-training method that edits the model parameters using weight arithmetic. We isolate a behavior direction in weight-space by subtracting the weight deltas from two small fine-tunes -- one that induces the desired behavior and another that induces its opposite -- and then add or remove this direction to modify the model's weights. We apply this technique to mitigate sycophancy and induce misalignment, and find that weight steering often generalizes further than activation steering, achieving stronger out-of-distribution behavioral control before degrading general capabilities. We also show that, in the context of task-specific fine-tuning, weight steering can partially mitigate undesired behavioral drift: it can reduce sycophancy and under-refusals introduced during fine-tuning while preserving task performance gains. Finally, we provide preliminary evidence that emergent misalignment can be detected by measuring the similarity between fine-tuning updates and an"evil"weight direction, suggesting that it may be possible to monitor the evolution of weights during training and detect rare misaligned behaviors that never manifest during training or evaluations.