BlackboxNLP-2025 MIB Shared Task: Exploring Ensemble Strategies for Circuit Localization Methods

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

249K/year

🤖 AI Summary

This work addresses the limited precision in localizing task-relevant subnetworks (“circuits”) within large language models (LLMs). We propose and systematically evaluate a multi-attribution integration framework, moving beyond single-attribution methods. Specifically, we design and validate two integration paradigms—parallel (e.g., averaging, extremum aggregation) and sequential—and find parallel fusion consistently superior. Our method initializes edge attribution scores using Edge Attribution via Path Integrated Gradients (EAP-IG), then applies robust pruning via fusion of multiple attribution scores. Evaluated on the MIB benchmark and diverse model-task combinations, our approach significantly outperforms official baselines, yielding substantial gains in circuit identification accuracy. The core contribution is the first systematic empirical demonstration that multi-attribution integration improves circuit localization, and the establishment of parallel fusion as the current best practice.

Technology Category

Application Category

📝 Abstract

The Circuit Localization track of the Mechanistic Interpretability Benchmark (MIB) evaluates methods for localizing circuits within large language models (LLMs), i.e., subnetworks responsible for specific task behaviors. In this work, we investigate whether ensembling two or more circuit localization methods can improve performance. We explore two variants: parallel and sequential ensembling. In parallel ensembling, we combine attribution scores assigned to each edge by different methods-e.g., by averaging or taking the minimum or maximum value. In the sequential ensemble, we use edge attribution scores obtained via EAP-IG as a warm start for a more expensive but more precise circuit identification method, namely edge pruning. We observe that both approaches yield notable gains on the benchmark metrics, leading to a more precise circuit identification approach. Finally, we find that taking a parallel ensemble over various methods, including the sequential ensemble, achieves the best results. We evaluate our approach in the BlackboxNLP 2025 MIB Shared Task, comparing ensemble scores to official baselines across multiple model-task combinations.

Problem

Research questions and friction points this paper is trying to address.

Ensemble strategies improve circuit localization in LLMs

Parallel and sequential methods enhance circuit identification precision

Combining attribution scores boosts benchmark performance metrics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parallel ensembling combines attribution scores from methods

Sequential ensemble uses EAP-IG as pruning warm start

Ensemble strategies improve circuit localization precision metrics

🔎 Similar Papers

Understanding Language Model Circuits through Knowledge Editing