Attributions All the Way Down? The Metagame of Interpretability

📅 2026-05-07
📈 Citations: 0
Influential: 0
📄 PDF

career value

202K/year
🤖 AI Summary
Existing model explanation methods struggle to quantify second-order interactions and dependencies among features. This work proposes a “meta-game” framework that models the attribution process as a cooperative game, leveraging Shapley values to compute the directed influence of feature j on the attribution of feature i—termed meta-attribution. It establishes, for the first time, a hierarchical decomposition theory linking attribution and meta-attribution, formally extending existing interaction metrics into a directional formulation. The approach effectively uncovers intricate explanatory mechanisms in instruction-tuned language models, vision–language encoders, and multimodal diffusion Transformers, revealing token-level interactions, cross-modal similarities, and text-to-image concept mappings. This significantly enhances the depth and granularity of interpretability analysis.
📝 Abstract
We introduce the metagame, a conceptual framework for quantifying second-order interaction effects of model explanations. For any first-order attribution $φ(f)$ explaining a model $f$, we measure the directional influence of feature $j$ on the attribution of feature $i$, denoted as meta-attribution $\varphi_{j \to i}(f)$, by treating the attribution method itself as a cooperative game and computing its Shapley value. Theoretically, we prove that attributions hierarchically decompose into meta-attributions, and establish these as directional extensions of existing interaction indices. Empirically, we demonstrate that the metagame delivers insights across diverse interpretability applications: (i) quantifying token interactions in instruction-tuned language models, (ii) explaining cross-modal similarity in vision-language encoders, and (iii) interpreting text-to-image concepts in multimodal diffusion transformers.
Problem

Research questions and friction points this paper is trying to address.

interpretability
attribution
interaction effects
Shapley value
meta-attribution
Innovation

Methods, ideas, or system contributions that make the work stand out.

metagame
meta-attribution
Shapley value
second-order interaction
model interpretability
🔎 Similar Papers
No similar papers found.