BlackboxNLP-2025 MIB Shared Task: Improving Circuit Faithfulness via Better Edge Selection

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the core challenge in mechanistic interpretability—high-fidelity identification of critical computational pathways underlying a model’s execution of specific tasks. To improve edge selection quality in circuit discovery, we propose a novel method departing from conventional greedy strategies: (i) attribution-score bootstrapping to enhance selection stability; (ii) a ratio-based thresholding mechanism for principled edge filtering; and (iii) the first formulation of edge selection as an integer linear programming (ILP) problem, enabling global optimization that jointly balances faithfulness and sparsity. Experiments across multiple MIB benchmark tasks and models of varying scales demonstrate that our approach significantly outperforms existing circuit discovery methods, achieving higher attribution faithfulness without sacrificing circuit compactness. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract
One of the main challenges in mechanistic interpretability is circuit discovery, determining which parts of a model perform a given task. We build on the Mechanistic Interpretability Benchmark (MIB) and propose three key improvements to circuit discovery. First, we use bootstrapping to identify edges with consistent attribution scores. Second, we introduce a simple ratio-based selection strategy to prioritize strong positive-scoring edges, balancing performance and faithfulness. Third, we replace the standard greedy selection with an integer linear programming formulation. Our methods yield more faithful circuits and outperform prior approaches across multiple MIB tasks and models. Our code is available at: https://github.com/technion-cs-nlp/MIB-Shared-Task.
Problem

Research questions and friction points this paper is trying to address.

Improving circuit faithfulness in mechanistic interpretability models
Identifying consistent edges using bootstrapped attribution scores
Balancing performance with ratio-based edge selection strategy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Bootstrapping identifies edges with consistent attribution scores
Ratio-based selection prioritizes strong positive-scoring edges
Integer linear programming replaces standard greedy selection
🔎 Similar Papers
No similar papers found.