Pando: Do Interpretability Methods Work When Models Won't Explain Themselves?

📅 2026-04-13

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work addresses a critical ambiguity in existing interpretability evaluation methods, which often conflate internal model signals with behaviors elicited by black-box prompting, thereby obscuring their true utility. To disentangle these effects, the authors introduce the Pando benchmark, which trains models to generate faithful, absent, or misleading explanations—enabling the first systematic separation of black-box prompting effects from white-box interpretability tools in decision prediction. The study evaluates gradient-based attribution, relevance patching (RelP), logit lens, sparse autoencoders, and circuit tracing across 720 models fine-tuned on decision tree rules. Results reveal that when explanations are faithful, black-box methods perform comparably or even superiorly to white-box approaches; under absent or misleading explanations, gradient methods gain 3–5% accuracy, RelP shows the most pronounced improvement, while other methods exhibit no consistent benefit.

Technology Category

Application Category

📝 Abstract

Mechanistic interpretability is often motivated for alignment auditing, where a model's verbal explanations can be absent, incomplete, or misleading. Yet many evaluations do not control whether black-box prompting alone can recover the target behavior, so apparent gains from white-box tools may reflect elicitation rather than internal signal; we call this the elicitation confounder. We introduce Pando, a model-organism benchmark that breaks this confound via an explanation axis: models are trained to produce either faithful explanations of the true rule, no explanation, or confident but unfaithful explanations of a disjoint distractor rule. Across 720 finetuned models implementing hidden decision-tree rules, agents predict held-out model decisions from $10$ labeled query-response pairs, optionally augmented with one interpretability tool output. When explanations are faithful, black-box elicitation matches or exceeds all white-box methods; when explanations are absent or misleading, gradient-based attribution improves accuracy by 3-5 percentage points, and relevance patching, RelP, gives the largest gains, while logit lens, sparse autoencoders, and circuit tracing provide no reliable benefit. Variance decomposition suggests gradients track decision computation, which fields causally drive the output, whereas other readouts are dominated by task representation, biases toward field identity and value. We release all models, code, and evaluation infrastructure.

Problem

Research questions and friction points this paper is trying to address.

mechanistic interpretability

elicitation confounder

alignment auditing

faithful explanations

black-box prompting

Innovation

Methods, ideas, or system contributions that make the work stand out.

elicitation confounder

mechanistic interpretability

Pando benchmark