Fixed Point Explainability

📅 2025-05-18

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Model explanations often suffer from inconsistency and unreliability, undermining trust in interpretability methods. Method: We propose the Fixed-Point Interpretability (FPI) framework, which formally defines a “fixed-point explanation” as one satisfying minimality, stability, and faithfulness. Leveraging fixed-point theory and convergence analysis, FPI recursively evaluates the interaction between a model and an explainer until convergence, exposing latent model behaviors and explainer weaknesses. The framework is instantiated for diverse explainers—including feature attribution methods and sparse autoencoders—and systematically verifies their convergence conditions. Contribution/Results: We introduce the first theoretical paradigm that models explanation stability as a fixed-point problem, establishing a new evaluation benchmark for interpretability. Experiments provide quantitative stability metrics and canonical failure cases, demonstrating significant improvements in explanation reliability and trustworthiness across multiple architectures and datasets.

Technology Category

Application Category

📝 Abstract

This paper introduces a formal notion of fixed point explanations, inspired by the"why regress"principle, to assess, through recursive applications, the stability of the interplay between a model and its explainer. Fixed point explanations satisfy properties like minimality, stability, and faithfulness, revealing hidden model behaviours and explanatory weaknesses. We define convergence conditions for several classes of explainers, from feature-based to mechanistic tools like Sparse AutoEncoders, and we report quantitative and qualitative results.

Problem

Research questions and friction points this paper is trying to address.

Assessing model-explainer interplay stability via fixed point explanations

Defining convergence conditions for diverse explainer classes

Evaluating minimality, stability, faithfulness in revealing model behaviors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Formal fixed point explanations for model stability

Convergence conditions for diverse explainer classes

Quantitative and qualitative assessment of explanatory properties

🔎 Similar Papers

No similar papers found.

Authors to Follow