🤖 AI Summary
This work exposes critical reliability flaws in prevailing activation-guided intervention methods for lightweight language models—namely, DoLa, function vectors, and task vectors—when deployed across diverse model architectures. Addressing the narrow scope and insufficient robustness validation of prior studies, we conduct a systematic evaluation across 36 large language models spanning 14 model families and parameter counts from 1.5B to 70B. Results reveal severe model dependency: performance gains are rare, with no improvement observed on most models and significant degradation occurring in nearly half. Causal analysis further invalidates the core assumptions underpinning these methods—namely, linear separability of activation subspaces and universality of task-aligned directions—in practical settings. To our knowledge, this is the first large-scale, cross-model benchmark for activation-guided interventions. It identifies fundamental reliability bottlenecks and provides empirical grounding for developing trustworthy, controllable model editing techniques.
📝 Abstract
Steering methods for language models (LMs) have gained traction as lightweight alternatives to fine-tuning, enabling targeted modifications to model activations. However, prior studies primarily report results on a few models, leaving critical gaps in understanding the robustness of these methods. In this work, we systematically examine three prominent steering methods -- DoLa, function vectors, and task vectors. In contrast to the original studies, which evaluated a handful of models, we test up to 36 models belonging to 14 families with sizes ranging from 1.5B to 70B parameters. Our experiments reveal substantial variability in the effectiveness of the steering approaches, with a large number of models showing no improvement and at times degradation in steering performance. Our analysis demonstrate fundamental flaws in the assumptions underlying these methods, challenging their reliability as scalable steering solutions.