🤖 AI Summary
To address the issue of background activation and insufficient foreground focus in ProtoPNet prototypes on Vision Transformers (ViTs), which degrades interpretability, this paper proposes a dual-branch prototypical architecture: a global branch captures holistic semantics to suppress background interference, while a local branch explicitly learns discriminative visual parts; both branches collaborate via a prototype coupling mechanism. We introduce the first native ViT-based prototypical learning framework, integrating prototype matching with contrastive learning, explicit part-level supervision, and cross-branch feature calibration. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art approaches across multiple benchmarks in classification accuracy, visualization clarity, and quantitative interpretability metrics—including Intersection-over-Union (IoU) and Faithfulness. The source code is publicly available.
📝 Abstract
Prototypical part network (ProtoPNet) and its variants have drawn wide attention and been applied to various tasks due to their inherent self-explanatory property. Previous ProtoPNets are primarily built upon convolutional neural networks (CNNs). Therefore, it is natural to investigate whether these explainable methods can be advantageous for the recently emerged Vision Transformers (ViTs). However, directly utilizing ViT-backed models as backbones can lead to prototypes paying excessive attention to background positions rather than foreground objects (i.e., the “distraction” problem). To address the problem, this paper proposes prototypical part Transformer (ProtoPFormer) for interpretable image recognition. Based the architectural characteristics of ViTs, we modify the original ProtoPNet by creating separate global and local branches, each accompanied by corresponding prototypes that can capture and highlight representative holistic and partial features. Specifically, the global prototypes can guide local prototypes to concentrate on the foreground and effectively suppress the background influence. Subsequently, local prototypes are explicitly supervised to concentrate on different discriminative visual parts. Finally, the two branches mutually correct each other and jointly make the final decisions. Moreover, extensive experiments demonstrate that ProtoPFormer can consistently achieve superior performance on accuracy, visualization results, and quantitative interpretability evaluation over the state-of-the-art (SOTA) baselines. Our code has been released at https://github.com/zju-vipa/ProtoPFormer.