🤖 AI Summary
This work addresses the lack of human-interpretable concepts in intermediate-layer representations of CNNs. We propose an unsupervised post-hoc method that optimizes an orthogonal rotation in feature space to extract disentangled, concept-level interpretable basis vectors from sparsely thresholded activation responses. Unlike supervised approaches relying on manually annotated concepts, ours is the first purely unsupervised paradigm for discovering highly interpretable bases. We further introduce an improved interpretability metric and a concept-alignment analysis framework, validating our method across multiple CNN architectures and datasets. Experiments demonstrate that the rotated intermediate representations significantly outperform supervised basis extraction methods in both conceptual diversity and interpretability. Our results reveal an inherent limitation of supervised paradigms—namely, their restricted coverage of conceptual breadth—and open a new direction for model interpretability research. (149 words)
📝 Abstract
An important line of research attempts to explain convolutional neural network (CNN) image classifier predictions and intermediate layer representations in terms of human understandable concepts. In this work, we expand on previous works in the literature that use annotated concept datasets to extract interpretable feature space directions and propose an unsupervised post-hoc method to extract a disentangling interpretable basis by looking for the rotation of the feature space that explains sparse one-hot thresholded transformed representations of pixel activations. We do experimentation with existing popular CNNs and demonstrate the effectiveness of our method in extracting an interpretable basis across network architectures and training datasets. We make extensions to the existing basis interpretability metrics found in the literature and show that intermediate layer representations become more interpretable when transformed to the bases extracted with our method. Finally, using the basis interpretability metrics, we compare the bases extracted with our method with the bases derived with a supervised approach and find that, in one aspect, the proposed unsupervised approach has a strength that constitutes a limitation of the supervised one and give potential directions for future research.