🤖 AI Summary
Existing intervention-based model steering methods are prone to overfitting and produce unnatural outputs, limiting their ability to accurately reveal a model’s internal mechanisms. This work proposes Concept DAS (CDAS), a data-driven approach that enables bidirectional steering by aligning intervention outputs with counterfactual distributions through Distributed Interventional Swapping (DII) and weakly supervised distribution matching. CDAS abandons the conventional objective of probability maximization, thereby reducing reliance on hyperparameters and enhancing control stability. The method effectively supports causal variable localization, demonstrates performance improvements with increasing model scale on the AxBench benchmark, and successfully suppresses refusal behaviors and neutralizes chain-of-thought backdoors in safety-critical scenarios—all while preserving the model’s general capabilities.
📝 Abstract
Intervention-based model steering offers a lightweight and interpretable alternative to prompting and fine-tuning. However, by adapting strong optimization objectives from fine-tuning, current methods are susceptible to overfitting and often underperform, sometimes generating unnatural outputs. We hypothesize that this is because effective steering requires the faithful identification of internal model mechanisms, not the enforcement of external preferences. To this end, we build on the principles of distributed alignment search (DAS), the standard for causal variable localization, to propose a new steering method: Concept DAS (CDAS). While we adopt the core mechanism of DAS, distributed interchange intervention (DII), we introduce a novel distribution matching objective tailored for the steering task by aligning intervened output distributions with counterfactual distributions. CDAS differs from prior work in two main ways: first, it learns interventions via weak-supervised distribution matching rather than probability maximization; second, it uses DIIs that naturally enable bi-directional steering and allow steering factors to be derived from data, reducing the effort required for hyperparameter tuning and resulting in more faithful and stable control. On AxBench, a large-scale model steering benchmark, we show that CDAS does not always outperform preference-optimization methods but may benefit more from increased model scale. In two safety-related case studies, overriding refusal behaviors of safety-aligned models and neutralizing a chain-of-thought backdoor, CDAS achieves systematic steering while maintaining general model utility. These results indicate that CDAS is complementary to preference-optimization approaches and conditionally constitutes a robust approach to intervention-based model steering. Our code is available at https://github.com/colored-dye/concept_das.