SteeringControl: Holistic Evaluation of Alignment Steering in LLMs

📅 2025-09-16

📈 Citations: 0

✨ Influential: 0

career value

221K/year

🤖 AI Summary

Prior work overlooks the coupling between core alignment objectives—bias mitigation, harm reduction, and hallucination suppression—and secondary behavioral traits—such as sycophancy and commonsense morality—leading to unexamined trade-offs in representation steering. Method: We introduce SteeringControl, the first dedicated benchmark for multi-dimensional alignment evaluation, and propose a modular steering framework. We conduct systematic, cross-target and cross-model assessments of five mainstream steering methods on Qwen-2.5-7B and Llama-3.1-8B. Contribution/Results: Our empirical study is the first to reveal substantial performance variance across methods, targets, and models; demonstrates that improper steering induces concept entanglement; and establishes that fundamental alignment objectives exhibit pervasive, previously underappreciated intrinsic trade-offs. These findings provide both theoretical grounding and practical tools for interpretable, controllable alignment interventions.

Technology Category

Application Category

📝 Abstract

We introduce SteeringControl, a benchmark for evaluating representation steering methods across core alignment objectives--bias, harmful generation, and hallucination--and their effects on secondary behaviors such as sycophancy and commonsense morality. While prior alignment work often highlights truthfulness or reasoning ability to demonstrate the side effects of representation steering, we find there are many unexplored tradeoffs not yet understood in a systematic way. We collect a dataset of safety-relevant primary and secondary behaviors to evaluate steering effectiveness and behavioral entanglement centered around five popular steering methods. To enable this, we craft a modular steering framework based on unique components that serve as the building blocks of many existing methods. Our results on Qwen-2.5-7B and Llama-3.1-8B find that strong steering performance is dependent on the specific combination of steering method, model, and targeted behavior, and that severe concept entanglement can result from poor combinations of these three as well. We release our code here: https://github.com/wang-research-lab/SteeringControl.git.

Problem

Research questions and friction points this paper is trying to address.

Evaluating alignment steering effects on bias, harm, and hallucination

Assessing secondary behavior impacts like sycophancy and morality

Analyzing performance dependencies across methods, models, and behaviors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark for evaluating representation steering methods

Modular steering framework with unique components

Analysis of steering performance and concept entanglement

🔎 Similar Papers

Inverse-RLignment: Inverse Reinforcement Learning from Demonstrations for LLM Alignment