🤖 AI Summary
Prior work overlooks the coupling between core alignment objectives—bias mitigation, harm reduction, and hallucination suppression—and secondary behavioral traits—such as sycophancy and commonsense morality—leading to unexamined trade-offs in representation steering. Method: We introduce SteeringControl, the first dedicated benchmark for multi-dimensional alignment evaluation, and propose a modular steering framework. We conduct systematic, cross-target and cross-model assessments of five mainstream steering methods on Qwen-2.5-7B and Llama-3.1-8B. Contribution/Results: Our empirical study is the first to reveal substantial performance variance across methods, targets, and models; demonstrates that improper steering induces concept entanglement; and establishes that fundamental alignment objectives exhibit pervasive, previously underappreciated intrinsic trade-offs. These findings provide both theoretical grounding and practical tools for interpretable, controllable alignment interventions.
📝 Abstract
We introduce SteeringControl, a benchmark for evaluating representation steering methods across core alignment objectives--bias, harmful generation, and hallucination--and their effects on secondary behaviors such as sycophancy and commonsense morality. While prior alignment work often highlights truthfulness or reasoning ability to demonstrate the side effects of representation steering, we find there are many unexplored tradeoffs not yet understood in a systematic way. We collect a dataset of safety-relevant primary and secondary behaviors to evaluate steering effectiveness and behavioral entanglement centered around five popular steering methods. To enable this, we craft a modular steering framework based on unique components that serve as the building blocks of many existing methods. Our results on Qwen-2.5-7B and Llama-3.1-8B find that strong steering performance is dependent on the specific combination of steering method, model, and targeted behavior, and that severe concept entanglement can result from poor combinations of these three as well. We release our code here: https://github.com/wang-research-lab/SteeringControl.git.