🤖 AI Summary
This work addresses a critical limitation of existing explainable AI methods, which predominantly offer passive attribution and thus fail to support practitioners in effectively intervening on model behavior. To bridge this gap, the authors propose an interactive analysis workflow that integrates sparse autoencoder (SAE)-based attribution with activation intervention, introducing activation steering into human-in-the-loop debugging for the first time and enabling a paradigm shift from observation to active intervention. Through semi-structured interviews with eight domain experts, the study reveals that users commonly engage in intervention-based hypothesis testing, primarily employing component suppression strategies and grounding their trust in model responses rather than the plausibility of explanations. The research also uncovers key risks—including ripple effects and limited instance-level generalizability of corrections—thereby charting a new path toward trustworthy AI debugging.
📝 Abstract
Explainable AI (XAI) methods reveal which features influence model predictions, yet provide limited means for practitioners to act on these explanations. Activation steering of components identified via XAI offers a path toward actionable explanations, although its practical utility remains understudied. We introduce an interactive workflow combining SAE-based attribution with activation steering for instance-level analysis of concept usage in vision models, implemented as a web-based tool. Based on this workflow, we conduct semi-structured expert interviews (N=8) with debugging tasks on CLIP to investigate how practitioners reason about, trust, and apply activation steering. We find that steering enables a shift from inspection to intervention-based hypothesis testing (8/8 participants), with most grounding trust in observed model responses rather than explanation plausibility alone (6/8). Participants adopted systematic debugging strategies dominated by component suppression (7/8) and highlighted risks including ripple effects and limited generalization of instance-level corrections. Overall, activation steering renders interpretability more actionable while raising important considerations for safe and effective use.