🤖 AI Summary
This work addresses the core challenge in mechanistic interpretability—high-fidelity identification of critical computational pathways underlying a model’s execution of specific tasks. To improve edge selection quality in circuit discovery, we propose a novel method departing from conventional greedy strategies: (i) attribution-score bootstrapping to enhance selection stability; (ii) a ratio-based thresholding mechanism for principled edge filtering; and (iii) the first formulation of edge selection as an integer linear programming (ILP) problem, enabling global optimization that jointly balances faithfulness and sparsity. Experiments across multiple MIB benchmark tasks and models of varying scales demonstrate that our approach significantly outperforms existing circuit discovery methods, achieving higher attribution faithfulness without sacrificing circuit compactness. The implementation is publicly available.
📝 Abstract
One of the main challenges in mechanistic interpretability is circuit discovery, determining which parts of a model perform a given task. We build on the Mechanistic Interpretability Benchmark (MIB) and propose three key improvements to circuit discovery. First, we use bootstrapping to identify edges with consistent attribution scores. Second, we introduce a simple ratio-based selection strategy to prioritize strong positive-scoring edges, balancing performance and faithfulness. Third, we replace the standard greedy selection with an integer linear programming formulation. Our methods yield more faithful circuits and outperform prior approaches across multiple MIB tasks and models. Our code is available at: https://github.com/technion-cs-nlp/MIB-Shared-Task.