Fixed Point Explainability

📅 2025-05-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Model explanations often suffer from inconsistency and unreliability, undermining trust in interpretability methods. Method: We propose the Fixed-Point Interpretability (FPI) framework, which formally defines a “fixed-point explanation” as one satisfying minimality, stability, and faithfulness. Leveraging fixed-point theory and convergence analysis, FPI recursively evaluates the interaction between a model and an explainer until convergence, exposing latent model behaviors and explainer weaknesses. The framework is instantiated for diverse explainers—including feature attribution methods and sparse autoencoders—and systematically verifies their convergence conditions. Contribution/Results: We introduce the first theoretical paradigm that models explanation stability as a fixed-point problem, establishing a new evaluation benchmark for interpretability. Experiments provide quantitative stability metrics and canonical failure cases, demonstrating significant improvements in explanation reliability and trustworthiness across multiple architectures and datasets.

Technology Category

Application Category

📝 Abstract
This paper introduces a formal notion of fixed point explanations, inspired by the"why regress"principle, to assess, through recursive applications, the stability of the interplay between a model and its explainer. Fixed point explanations satisfy properties like minimality, stability, and faithfulness, revealing hidden model behaviours and explanatory weaknesses. We define convergence conditions for several classes of explainers, from feature-based to mechanistic tools like Sparse AutoEncoders, and we report quantitative and qualitative results.
Problem

Research questions and friction points this paper is trying to address.

Assessing model-explainer interplay stability via fixed point explanations
Defining convergence conditions for diverse explainer classes
Evaluating minimality, stability, faithfulness in revealing model behaviors
Innovation

Methods, ideas, or system contributions that make the work stand out.

Formal fixed point explanations for model stability
Convergence conditions for diverse explainer classes
Quantitative and qualitative assessment of explanatory properties
🔎 Similar Papers
No similar papers found.