🤖 AI Summary
This work addresses the opacity of deep learning model decisions by proposing I2X, a novel framework that transforms unstructured post-hoc explanations—such as those from GradCAM—into structured interpretations during model training. Unlike existing interpretability methods that rely on external surrogate models and yield unstructured outputs, I2X extracts prototype representations from internal checkpoints to faithfully reveal both intra-class and inter-class decision logic without requiring additional proxy models. The framework further enables targeted model refinement based on uncertain prototypes. Experiments on MNIST and CIFAR-10 demonstrate that I2X not only clearly elucidates the prototype-based reasoning process of image classification models but also effectively enhances their predictive accuracy.
📝 Abstract
Deep learning models achieve remarkable predictive performance, yet their black-box nature limits transparency and trustworthiness. Although numerous explainable artificial intelligence (XAI) methods have been proposed, they primarily provide saliency maps or concepts (i.e., unstructured interpretability). Existing approaches often rely on auxiliary models (\eg, GPT, CLIP) to describe model behavior, thereby compromising faithfulness to the original models. We propose Interpretability to Explainability (I2X), a framework that builds structured explanations directly from unstructured interpretability by quantifying progress at selected checkpoints during training using prototypes extracted from post-hoc XAI methods (e.g., GradCAM). I2X answers the question of"why does it look there"by providing a structured view of both intra- and inter-class decision making during training. Experiments on MNIST and CIFAR10 demonstrate effectiveness of I2X to reveal prototype-based inference process of various image classification models. Moreover, we demonstrate that I2X can be used to improve predictions across different model architectures and datasets: we can identify uncertain prototypes recognized by I2X and then use targeted perturbation of samples that allows fine-tuning to ultimately improve accuracy. Thus, I2X not only faithfully explains model behavior but also provides a practical approach to guide optimization toward desired targets.