🤖 AI Summary
Current large language models (LLMs) exhibit inconsistent and imprecise refusal behavior across diverse prompts. To address this, we propose Affine Concept Editing (ACE), the first method to reveal that refusal responses fundamentally arise from affine functions operating in the model’s activation space. Leveraging this insight, ACE introduces an interpretable and intervention-friendly affine control mechanism: it models the refusal direction via affine decomposition and achieves targeted editing through subspace projection and activation vector addition. Crucially, ACE requires no fine-tuning or external classifiers, and demonstrates consistent, cross-model refusal control—validated on ten mainstream models including Llama-3-70B—as well as robust generalization across prompt types. Automated evaluation shows that ACE significantly outperforms existing directional intervention methods in both refusal accuracy and generalizability.
📝 Abstract
We propose affine concept editing (ACE) as an approach for steering language models' behavior by intervening directly in activations. We begin with an affine decomposition of model activation vectors and show that prior methods for steering model behavior correspond to subsets of terms of this decomposition. We then provide a derivation of ACE and use it to control refusal behavior on ten different models, including Llama 3 70B. ACE combines affine subspace projection and activation addition to reliably control the model's refusal responses across prompt types. We evaluate the results using LLM-based scoring on a collection of harmful and harmless prompts. Our experiments demonstrate that ACE consistently achieves more precise control over model behavior than existing methods and generalizes to models where directional ablation via affine subspace projection alone produces incoherent outputs. Code for reproducing our results is available at https://github.com/EleutherAI/steering-llama3 .