🤖 AI Summary
Existing visual explanation methods (e.g., Grad-CAM) yield coarse object localization in fine-grained recognition, failing to precisely identify discriminative local regions that distinguish highly similar subclasses (e.g., bird or dog species), thereby limiting model interpretability. To address this, we propose Prompt-CAM—a novel attribution method that requires no modification or retraining of the ViT backbone; instead, it fine-tunes only a lightweight prompt module. Prompt-CAM employs class-specific visual prompts to steer multi-head self-attention toward discriminative image patches, eliminating gradient dependence and directly generating class-conditional attention heatmaps. Evaluated across十余 fine-grained datasets—including birds, fish, and insects—Prompt-CAM consistently outperforms baselines such as Grad-CAM and INTR, achieving significant improvements in localization accuracy, explanatory fidelity, and cross-dataset generalization.
📝 Abstract
We present a simple usage of pre-trained Vision Transformers (ViTs) for fine-grained analysis, aiming to identify and localize the traits that distinguish visually similar categories, such as different bird species or dog breeds. Pre-trained ViTs such as DINO have shown remarkable capabilities to extract localized, informative features. However, using saliency maps like Grad-CAM can hardly point out the traits: they often locate the whole object by a blurred, coarse heatmap, not traits. We propose a novel approach Prompt Class Attention Map (Prompt-CAM) to the rescue. Prompt-CAM learns class-specific prompts to a pre-trained ViT and uses the corresponding outputs for classification. To classify an image correctly, the true-class prompt must attend to the unique image patches not seen in other classes' images, i.e., traits. As such, the true class's multi-head attention maps reveal traits and their locations. Implementation-wise, Prompt-CAM is almost a free lunch by simply modifying the prediction head of Visual Prompt Tuning (VPT). This makes Prompt-CAM fairly easy to train and apply, sharply contrasting other interpretable methods that design specific models and training processes. It is even simpler than the recently published INterpretable TRansformer (INTR), whose encoder-decoder architecture prevents it from leveraging pre-trained ViTs. Extensive empirical studies on a dozen datasets from various domains (e.g., birds, fishes, insects, fungi, flowers, food, and cars) validate Prompt-CAM superior interpretation capability.