SteeringControl: Holistic Evaluation of Alignment Steering in LLMs

📅 2025-09-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Prior work overlooks the coupling between core alignment objectives—bias mitigation, harm reduction, and hallucination suppression—and secondary behavioral traits—such as sycophancy and commonsense morality—leading to unexamined trade-offs in representation steering. Method: We introduce SteeringControl, the first dedicated benchmark for multi-dimensional alignment evaluation, and propose a modular steering framework. We conduct systematic, cross-target and cross-model assessments of five mainstream steering methods on Qwen-2.5-7B and Llama-3.1-8B. Contribution/Results: Our empirical study is the first to reveal substantial performance variance across methods, targets, and models; demonstrates that improper steering induces concept entanglement; and establishes that fundamental alignment objectives exhibit pervasive, previously underappreciated intrinsic trade-offs. These findings provide both theoretical grounding and practical tools for interpretable, controllable alignment interventions.

Technology Category

Application Category

📝 Abstract
We introduce SteeringControl, a benchmark for evaluating representation steering methods across core alignment objectives--bias, harmful generation, and hallucination--and their effects on secondary behaviors such as sycophancy and commonsense morality. While prior alignment work often highlights truthfulness or reasoning ability to demonstrate the side effects of representation steering, we find there are many unexplored tradeoffs not yet understood in a systematic way. We collect a dataset of safety-relevant primary and secondary behaviors to evaluate steering effectiveness and behavioral entanglement centered around five popular steering methods. To enable this, we craft a modular steering framework based on unique components that serve as the building blocks of many existing methods. Our results on Qwen-2.5-7B and Llama-3.1-8B find that strong steering performance is dependent on the specific combination of steering method, model, and targeted behavior, and that severe concept entanglement can result from poor combinations of these three as well. We release our code here: https://github.com/wang-research-lab/SteeringControl.git.
Problem

Research questions and friction points this paper is trying to address.

Evaluating alignment steering effects on bias, harm, and hallucination
Assessing secondary behavior impacts like sycophancy and morality
Analyzing performance dependencies across methods, models, and behaviors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for evaluating representation steering methods
Modular steering framework with unique components
Analysis of steering performance and concept entanglement
🔎 Similar Papers
No similar papers found.
Vincent Siu
Vincent Siu
University of California, Santa Cruz (UCSC)
Natural Language Processing
Nicholas Crispino
Nicholas Crispino
PhD Student, University of California, Santa Cruz
Natural Language Processing
D
David Park
Washington University in St. Louis
N
Nathan W. Henry
University of California, Berkeley
Zhun Wang
Zhun Wang
Graduate Student, UC Berkeley
Y
Yang Liu
University of California, Santa Cruz
Dawn Song
Dawn Song
Professor of Computer Science, UC Berkeley
Computer Security and Privacy
C
Chenguang Wang
University of California, Santa Cruz