🤖 AI Summary
Existing automatic concept discovery methods often decouple learned concepts from the model’s true decision-making mechanism, undermining explanatory fidelity. To address this, we propose a falsifiable concept learning framework grounded in non-negative matrix factorization (NMF). Our approach jointly incorporates KL-divergence regularization and classifier supervision to explicitly align low-level features with high-level semantic concepts, thereby ensuring conceptual consistency with model predictions. This design provides theoretical guarantees on local linear faithfulness—i.e., the extent to which concept-based approximations preserve the original model’s behavior in local neighborhoods. Extensive experiments on ImageNet, COCO, and CelebA demonstrate that our method significantly outperforms state-of-the-art approaches on two core metrics: concept fidelity (alignment with ground-truth semantics) and concept sparsity (compactness of the concept basis). By bridging the gap between representation learning and model interpretation, our framework enhances both the faithfulness and reliability of deep neural network explanations.
📝 Abstract
Interpreting deep neural networks through concept-based explanations offers a bridge between low-level features and high-level human-understandable semantics. However, existing automatic concept discovery methods often fail to align these extracted concepts with the model's true decision-making process, thereby compromising explanation faithfulness. In this work, we propose FACE (Faithful Automatic Concept Extraction), a novel framework that augments Non-negative Matrix Factorization (NMF) with a Kullback-Leibler (KL) divergence regularization term to ensure alignment between the model's original and concept-based predictions. Unlike prior methods that operate solely on encoder activations, FACE incorporates classifier supervision during concept learning, enforcing predictive consistency and enabling faithful explanations. We provide theoretical guarantees showing that minimizing the KL divergence bounds the deviation in predictive distributions, thereby promoting faithful local linearity in the learned concept space. Systematic evaluations on ImageNet, COCO, and CelebA datasets demonstrate that FACE outperforms existing methods across faithfulness and sparsity metrics.