MIB: A Mechanistic Interpretability Benchmark

📅 2025-04-17

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

This study addresses the lack of standardized evaluation protocols in mechanistic interpretability. We introduce MIB—the first standardized benchmark for mechanistic interpretability—comprising two core tasks: circuit localization and causal variable localization. To assess methods rigorously, we propose a dual-track, five-method evaluation framework that jointly optimizes for causal accuracy and model simplicity. The framework integrates attribution analysis, mask optimization, sparse autoencoders (SAEs), supervised distributed alignment search (DAS), and intervention-based evaluation (e.g., attribution patching). Experiments reveal that attribution and mask optimization achieve top performance on circuit localization, while supervised DAS significantly outperforms SAEs on causal variable localization; notably, SAEs fail to surpass the raw neuron baseline—demonstrating MIB’s strong discriminative power. MIB establishes a reproducible, comparable, and principled evaluation paradigm for mechanistic interpretability methods, advancing the field toward greater rigor and standardization.

Technology Category

Application Category

📝 Abstract

How can we know whether new mechanistic interpretability methods achieve real improvements? In pursuit of meaningful and lasting evaluation standards, we propose MIB, a benchmark with two tracks spanning four tasks and five models. MIB favors methods that precisely and concisely recover relevant causal pathways or specific causal variables in neural language models. The circuit localization track compares methods that locate the model components - and connections between them - most important for performing a task (e.g., attribution patching or information flow routes). The causal variable localization track compares methods that featurize a hidden vector, e.g., sparse autoencoders (SAEs) or distributed alignment search (DAS), and locate model features for a causal variable relevant to the task. Using MIB, we find that attribution and mask optimization methods perform best on circuit localization. For causal variable localization, we find that the supervised DAS method performs best, while SAE features are not better than neurons, i.e., standard dimensions of hidden vectors. These findings illustrate that MIB enables meaningful comparisons of methods, and increases our confidence that there has been real progress in the field.

Problem

Research questions and friction points this paper is trying to address.

Evaluates mechanistic interpretability methods for neural language models

Compares circuit localization methods for task performance

Assesses causal variable localization techniques in hidden vectors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes MIB benchmark for interpretability evaluation

Compares circuit localization methods like attribution patching

Evaluates causal variable localization using DAS and SAEs

🔎 Similar Papers

A Practical Review of Mechanistic Interpretability for Transformer-Based Language Models