🤖 AI Summary
Model explanations often suffer from inconsistency and unreliability, undermining trust in interpretability methods.
Method: We propose the Fixed-Point Interpretability (FPI) framework, which formally defines a “fixed-point explanation” as one satisfying minimality, stability, and faithfulness. Leveraging fixed-point theory and convergence analysis, FPI recursively evaluates the interaction between a model and an explainer until convergence, exposing latent model behaviors and explainer weaknesses. The framework is instantiated for diverse explainers—including feature attribution methods and sparse autoencoders—and systematically verifies their convergence conditions.
Contribution/Results: We introduce the first theoretical paradigm that models explanation stability as a fixed-point problem, establishing a new evaluation benchmark for interpretability. Experiments provide quantitative stability metrics and canonical failure cases, demonstrating significant improvements in explanation reliability and trustworthiness across multiple architectures and datasets.
📝 Abstract
This paper introduces a formal notion of fixed point explanations, inspired by the"why regress"principle, to assess, through recursive applications, the stability of the interplay between a model and its explainer. Fixed point explanations satisfy properties like minimality, stability, and faithfulness, revealing hidden model behaviours and explanatory weaknesses. We define convergence conditions for several classes of explainers, from feature-based to mechanistic tools like Sparse AutoEncoders, and we report quantitative and qualitative results.