BlackboxNLP-2025 MIB Shared Task: Improving Circuit Faithfulness via Better Edge Selection

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

career value

274K/year

🤖 AI Summary

This work addresses the core challenge in mechanistic interpretability—high-fidelity identification of critical computational pathways underlying a model’s execution of specific tasks. To improve edge selection quality in circuit discovery, we propose a novel method departing from conventional greedy strategies: (i) attribution-score bootstrapping to enhance selection stability; (ii) a ratio-based thresholding mechanism for principled edge filtering; and (iii) the first formulation of edge selection as an integer linear programming (ILP) problem, enabling global optimization that jointly balances faithfulness and sparsity. Experiments across multiple MIB benchmark tasks and models of varying scales demonstrate that our approach significantly outperforms existing circuit discovery methods, achieving higher attribution faithfulness without sacrificing circuit compactness. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

One of the main challenges in mechanistic interpretability is circuit discovery, determining which parts of a model perform a given task. We build on the Mechanistic Interpretability Benchmark (MIB) and propose three key improvements to circuit discovery. First, we use bootstrapping to identify edges with consistent attribution scores. Second, we introduce a simple ratio-based selection strategy to prioritize strong positive-scoring edges, balancing performance and faithfulness. Third, we replace the standard greedy selection with an integer linear programming formulation. Our methods yield more faithful circuits and outperform prior approaches across multiple MIB tasks and models. Our code is available at: https://github.com/technion-cs-nlp/MIB-Shared-Task.

Problem

Research questions and friction points this paper is trying to address.

Improving circuit faithfulness in mechanistic interpretability models

Identifying consistent edges using bootstrapped attribution scores

Balancing performance with ratio-based edge selection strategy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bootstrapping identifies edges with consistent attribution scores

Ratio-based selection prioritizes strong positive-scoring edges

Integer linear programming replaces standard greedy selection

🔎 Similar Papers

Finding Transformer Circuits with Edge Pruning