🤖 AI Summary
This work addresses the lack of interpretability in existing activation intervention methods for large language models, which often rely on black-box outputs or external evaluators and thus struggle to reliably assess steering efficacy. To overcome this limitation, the paper proposes information-theoretic metrics derived from internal model activations—namely, the Normalized Branching Factor (NBF) and KL divergence—as interpretable predictors of steering success. Furthermore, it introduces a human-aligned proxy annotation benchmark with high inter-annotator consistency to enhance evaluation objectivity. Through comprehensive experiments comparing Concept Activation Addition (CAA), sparse autoencoder-based interventions, and cross-model consistency validation, the study demonstrates that NBF and KL divergence significantly predict steering success rates, thereby establishing a more reliable and interpretable evaluation baseline for current intervention techniques.
📝 Abstract
Activation-based steering enables Large Language Models (LLMs) to exhibit targeted behaviors by intervening on intermediate activations without retraining. Despite its widespread use, the mechanistic factors that govern when steering succeeds or fails remain poorly understood, as prior work has relied primarily on black-box outputs or LLM-based judges. In this study, we investigate whether the reliability of steering can be diagnosed using internal model signals. We focus on two information-theoretic measures: the entropy-derived Normalized Branching Factor (NBF), and the Kullback-Leibler (KL) divergence between steered activations and targeted concepts in the vocabulary space. We hypothesize that effective steering corresponds to structured entropy preservation and coherent KL alignment across decoding steps. Building on a reliability study demonstrating high inter-judge agreement between two architecturally distinct LLMs, we use LLM-generated annotations as ground truth and show that these mechanistic signals provide meaningful predictive power for identifying successful steering and estimating failure probability. We further introduce a stronger evaluation baseline for Contrastive Activation Addition (CAA) and Sparse Autoencoder-based steering, the two most widely adopted activation-steering methods.