🤖 AI Summary
Current interpretability research for large language models (LLMs) treats interpretability and controllability as disjoint objectives. Method: This paper proposes “intervention capability” as a unified evaluation goal and introduces an encoder-decoder framework that integrates four method families—sparse autoencoders (SAEs), Logit Lens, Tuned Lens, and probes—to enable controllable interventions on interpretable features. Contribution/Results: We formally define two novel metrics—intervention success rate and consistency–intervention trade-off—and argue that effective intervention constitutes the foundational objective of interpretability. Experiments show that Lens-based methods outperform SAEs and probes in simple interventions; however, existing methods exhibit inconsistent cross-feature and cross-model intervention efficacy. Moreover, mechanistic interventions often underperform prompt engineering, revealing critical controllability bottlenecks. This work shifts LLM interpretability research from descriptive analysis toward causal, interventionist control.
📝 Abstract
With the growing complexity and capability of large language models, a need to understand model reasoning has emerged, often motivated by an underlying goal of controlling and aligning models. While numerous interpretability and steering methods have been proposed as solutions, they are typically designed either for understanding or for control, seldom addressing both. Additionally, the lack of standardized applications, motivations, and evaluation metrics makes it difficult to assess methods' practical utility and efficacy. To address the aforementioned issues, we argue that intervention is a fundamental goal of interpretability and introduce success criteria to evaluate how well methods can control model behavior through interventions. To evaluate existing methods for this ability, we unify and extend four popular interpretability methods-sparse autoencoders, logit lens, tuned lens, and probing-into an abstract encoder-decoder framework, enabling interventions on interpretable features that can be mapped back to latent representations to control model outputs. We introduce two new evaluation metrics: intervention success rate and coherence-intervention tradeoff, designed to measure the accuracy of explanations and their utility in controlling model behavior. Our findings reveal that (1) while current methods allow for intervention, their effectiveness is inconsistent across features and models, (2) lens-based methods outperform SAEs and probes in achieving simple, concrete interventions, and (3) mechanistic interventions often compromise model coherence, underperforming simpler alternatives, such as prompting, and highlighting a critical shortcoming of current interpretability approaches in applications requiring control.