🤖 AI Summary
Existing interpretable image classification models struggle to simultaneously achieve high accuracy, robustness, and parameter-free deployment on large-scale datasets. Method: We propose ComFe, a plug-and-play classification head for Vision Transformers (ViTs) that requires no image segmentation, part-level annotations, or additional hyperparameters—only global class labels—to autonomously discover discriminative local component features from pretrained ViT representations. ComFe integrates prototype-based distance metrics with a differentiable clustering mechanism for end-to-end interpretable modeling. Results: On large-scale benchmarks including ImageNet-1K, ComFe achieves classification accuracy on par with black-box ViTs—without fine-tuning or hyperparameter optimization—while significantly outperforming prior interpretable methods. It further demonstrates superior adversarial robustness, cross-dataset consistency, computational efficiency, and scalability.
📝 Abstract
Interpretable computer vision models explain their classifications through comparing the distances between the local embeddings of an image and a set of prototypes that represent the training data. However, these approaches introduce additional hyper-parameters that need to be tuned to apply to new datasets, scale poorly, and are more computationally intensive to train in comparison to black-box approaches. In this work, we introduce Component Features (ComFe), a highly scalable interpretable-by-design image classification head for pretrained Vision Transformers (ViTs) that can obtain competitive performance in comparison to comparable non-interpretable methods. ComFe is the first interpretable head, that we know of, and unlike other interpretable approaches, can be readily applied to large scale datasets such as ImageNet-1K. Additionally, ComFe provides improved robustness and outperforms previous interpretable approaches on key benchmark datasets$unicode{x2013}$using a consistent set of hyper-parameters and without finetuning the pretrained ViT backbone. With only global image labels and no segmentation or part annotations, ComFe can identify consistent component features within an image and determine which of these features are informative in making a prediction. Code is available at https://github.com/emannix/comfe-component-features.