ProtoPFormer: Concentrating on Prototypical Parts in Vision Transformers for Interpretable Image Recognition

📅 2022-08-22
🏛️ International Joint Conference on Artificial Intelligence
📈 Citations: 67
Influential: 4
📄 PDF
🤖 AI Summary
To address the issue of background activation and insufficient foreground focus in ProtoPNet prototypes on Vision Transformers (ViTs), which degrades interpretability, this paper proposes a dual-branch prototypical architecture: a global branch captures holistic semantics to suppress background interference, while a local branch explicitly learns discriminative visual parts; both branches collaborate via a prototype coupling mechanism. We introduce the first native ViT-based prototypical learning framework, integrating prototype matching with contrastive learning, explicit part-level supervision, and cross-branch feature calibration. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art approaches across multiple benchmarks in classification accuracy, visualization clarity, and quantitative interpretability metrics—including Intersection-over-Union (IoU) and Faithfulness. The source code is publicly available.
📝 Abstract
Prototypical part network (ProtoPNet) and its variants have drawn wide attention and been applied to various tasks due to their inherent self-explanatory property. Previous ProtoPNets are primarily built upon convolutional neural networks (CNNs). Therefore, it is natural to investigate whether these explainable methods can be advantageous for the recently emerged Vision Transformers (ViTs). However, directly utilizing ViT-backed models as backbones can lead to prototypes paying excessive attention to background positions rather than foreground objects (i.e., the “distraction” problem). To address the problem, this paper proposes prototypical part Transformer (ProtoPFormer) for interpretable image recognition. Based the architectural characteristics of ViTs, we modify the original ProtoPNet by creating separate global and local branches, each accompanied by corresponding prototypes that can capture and highlight representative holistic and partial features. Specifically, the global prototypes can guide local prototypes to concentrate on the foreground and effectively suppress the background influence. Subsequently, local prototypes are explicitly supervised to concentrate on different discriminative visual parts. Finally, the two branches mutually correct each other and jointly make the final decisions. Moreover, extensive experiments demonstrate that ProtoPFormer can consistently achieve superior performance on accuracy, visualization results, and quantitative interpretability evaluation over the state-of-the-art (SOTA) baselines. Our code has been released at https://github.com/zju-vipa/ProtoPFormer.
Problem

Research questions and friction points this paper is trying to address.

Vision Transformers struggle to focus on foreground prototypical parts
Prototypes are distracted by background features in transformer architectures
Existing methods fail to effectively highlight interpretable visual features
Innovation

Methods, ideas, or system contributions that make the work stand out.

Global prototypes guide local prototypes to foreground
Local prototypes focus on specific visual parts
Mutual correction between global and local prototypes
Mengqi Xue
Mengqi Xue
Zhejiang University, Hangzhou City University
Machine Learning
Qihan Huang
Qihan Huang
PhD Student, Zhejiang University
H
Haofei Zhang
Zhejiang University
J
Jie Song
Zhejiang University, ZJU-Bangsun Joint Research Center
M
Mingli Song
Hangzhou City University, Zhejiang University, ZJU-Bangsun Joint Research Center
Lechao Cheng
Lechao Cheng
Associate Professor, Hefei University of Technology
Imbalanced LearningDistillationNoisy Label LearningWeakly Supervised LearningVisual Tuning
M
Ming-hui Wu