MIB: A Mechanistic Interpretability Benchmark

📅 2025-04-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the lack of standardized evaluation protocols in mechanistic interpretability. We introduce MIB—the first standardized benchmark for mechanistic interpretability—comprising two core tasks: circuit localization and causal variable localization. To assess methods rigorously, we propose a dual-track, five-method evaluation framework that jointly optimizes for causal accuracy and model simplicity. The framework integrates attribution analysis, mask optimization, sparse autoencoders (SAEs), supervised distributed alignment search (DAS), and intervention-based evaluation (e.g., attribution patching). Experiments reveal that attribution and mask optimization achieve top performance on circuit localization, while supervised DAS significantly outperforms SAEs on causal variable localization; notably, SAEs fail to surpass the raw neuron baseline—demonstrating MIB’s strong discriminative power. MIB establishes a reproducible, comparable, and principled evaluation paradigm for mechanistic interpretability methods, advancing the field toward greater rigor and standardization.

Technology Category

Application Category

📝 Abstract
How can we know whether new mechanistic interpretability methods achieve real improvements? In pursuit of meaningful and lasting evaluation standards, we propose MIB, a benchmark with two tracks spanning four tasks and five models. MIB favors methods that precisely and concisely recover relevant causal pathways or specific causal variables in neural language models. The circuit localization track compares methods that locate the model components - and connections between them - most important for performing a task (e.g., attribution patching or information flow routes). The causal variable localization track compares methods that featurize a hidden vector, e.g., sparse autoencoders (SAEs) or distributed alignment search (DAS), and locate model features for a causal variable relevant to the task. Using MIB, we find that attribution and mask optimization methods perform best on circuit localization. For causal variable localization, we find that the supervised DAS method performs best, while SAE features are not better than neurons, i.e., standard dimensions of hidden vectors. These findings illustrate that MIB enables meaningful comparisons of methods, and increases our confidence that there has been real progress in the field.
Problem

Research questions and friction points this paper is trying to address.

Evaluates mechanistic interpretability methods for neural language models
Compares circuit localization methods for task performance
Assesses causal variable localization techniques in hidden vectors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes MIB benchmark for interpretability evaluation
Compares circuit localization methods like attribution patching
Evaluates causal variable localization using DAS and SAEs
Aaron Mueller
Aaron Mueller
Boston University
natural language processinginterpretabilityrobust generalizationsyntaxmultilingual NLP
Atticus Geiger
Atticus Geiger
Pr(Ai)²R Group
Artificial IntelligenceNatural LanguageMechanistic InterpretabilityCausality
Dana Arad
Dana Arad
PhD Student, Technion
NLPInterpretabilityVision-Language
I
Iv'an Arcuschin
University of Buenos Aires
A
Adam Belfki
Northeastern University
Yik Siu Chan
Yik Siu Chan
Brown University
machine learningmechanistic interpretabilityAI alignment
Jaden Fiotto-Kaufman
Jaden Fiotto-Kaufman
National Deep Inference Fabric
Tal Haklay
Tal Haklay
PhD student, Technion
M
Michael Hanna
University of Amsterdam
J
Jing Huang
Stanford University
R
Rohan Gupta
Independent
Y
Yaniv Nikankin
Technion – IIT
Hadas Orgad
Hadas Orgad
PhD student, Technion
natural language processingdeep learningfairnessrobustnessexplainability
N
Nikhil Prakash
Northeastern University
Anja Reusch
Anja Reusch
Post-Doc, Technion - IIT
Neural Information RetrievalNatural Language ProcessingInterpretability for IR
Aruna Sankaranarayanan
Aruna Sankaranarayanan
Massachusetts Institute of Technology
Shun Shao
Shun Shao
University of Cambridge
Artificial IntelligenceMachine learning
Alessandro Stolfo
Alessandro Stolfo
ETH Zürich
NLPMachine LearningInterpretability
M
Martin Tutek
Technion – IIT
Amir Zur
Amir Zur
Stanford University
Natural Language ProcessingModel Interpretability
David Bau
David Bau
Assistant Professor at Northeastern University
Machine LearningComputer VisionNLPSoftware EngineeringHCI
Yonatan Belinkov
Yonatan Belinkov
Technion
Natural Language ProcessingModel InterpretabilityArtificial Intelligence