🤖 AI Summary
This work addresses the prevalent issue of overly strong assumptions—such as class specificity, locality, and alignment with human priors—in existing deep learning concept-based explanations. We propose an intrinsic, model-agnostic concept explanation framework that employs a learnable concept tracing mechanism to faithfully extract and quantify shared concepts across classes, supporting concept–logit contribution analysis and input visualization at arbitrary network layers. A key innovation is the introduction of the C²-Score, an unsupervised, scalable metric grounded in foundation models, enabling the first quantitative evaluation of concept consistency without ground-truth supervision. On ImageNet, our method achieves state-of-the-art performance while demonstrating significant quantitative improvements in concept consistency. User studies confirm that the extracted concepts are more interpretable and comprehensible, outperforming mainstream post-hoc explanation methods across all evaluated dimensions.
📝 Abstract
Deep networks have shown remarkable performance across a wide range of tasks, yet getting a global concept-level understanding of how they function remains a key challenge. Many post-hoc concept-based approaches have been introduced to understand their workings, yet they are not always faithful to the model. Further, they make restrictive assumptions on the concepts a model learns, such as class-specificity, small spatial extent, or alignment to human expectations. In this work, we put emphasis on the faithfulness of such concept-based explanations and propose a new model with model-inherent mechanistic concept-explanations. Our concepts are shared across classes and, from any layer, their contribution to the logit and their input-visualization can be faithfully traced. We also leverage foundation models to propose a new concept-consistency metric, C$^2$-Score, that can be used to evaluate concept-based methods. We show that, compared to prior work, our concepts are quantitatively more consistent and users find our concepts to be more interpretable, all while retaining competitive ImageNet performance.