Playing the network backward: A Game Theoretic Attribution Framework

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This work addresses the lack of a unified framework for systematically comparing the computational mechanisms and explanatory properties of existing backward attribution methods. By reformulating backward attribution as a two-player game on an extended network graph—building upon the ReLU network game model of Gaubert and Vlassopoulos—it shows that gradient-based methods and alpha-beta-LRP emerge as integral forms of game trajectories under specific equilibria. For the first time, attribution properties such as localization focus and noise robustness are formally expressed through game-theoretic concepts like strategy regularization and risk aversion, enabling the derivation of novel attribution rules. Evaluated on ViT-B/16, the proposed alpha-beta-LRP variant consistently outperforms existing Transformer-specific methods across all localization metrics.

📝 Abstract

Attribution methods explain which input features drive a model's prediction, making them central to model debugging and mechanistic interpretability. Yet backward attribution methods, including gradients, LRP, and transformer-specific rules, lack a shared framework in which to compare the underlying backward calculations. We introduce such a framework by recasting backward attribution as a two-player game on an extended network graph, building on Gaubert and Vlassopoulos' ReLU Net Game. Gradients and the full alpha-beta-LRP family arise as integrals over game trajectories under specific equilibria, so attribution maps become projections of trajectory distributions rather than the primary object. Desired explanation properties, such as localisation focus, robustness to input noise, or stable attention routing, can be specified as game-theoretic concepts, including policy regularization, risk aversion, and extended action sets, and translate directly into novel adaptations of the well-known backward rules. On ViT-B/16, one such selected adaptation of alpha-beta-LRP outperforms prior transformer-specific backward methods across all considered localisation metrics.

Problem

Research questions and friction points this paper is trying to address.

attribution

backward methods

interpretability

game theory

framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

game-theoretic attribution

backward attribution

alpha-beta-LRP