ProtoPFormer: Concentrating on Prototypical Parts in Vision Transformers for Interpretable Image Recognition

📅 2022-08-22

🏛️ International Joint Conference on Artificial Intelligence

📈 Citations: 67

✨ Influential: 4

career value

165K/year

🤖 AI Summary

To address the issue of background activation and insufficient foreground focus in ProtoPNet prototypes on Vision Transformers (ViTs), which degrades interpretability, this paper proposes a dual-branch prototypical architecture: a global branch captures holistic semantics to suppress background interference, while a local branch explicitly learns discriminative visual parts; both branches collaborate via a prototype coupling mechanism. We introduce the first native ViT-based prototypical learning framework, integrating prototype matching with contrastive learning, explicit part-level supervision, and cross-branch feature calibration. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art approaches across multiple benchmarks in classification accuracy, visualization clarity, and quantitative interpretability metrics—including Intersection-over-Union (IoU) and Faithfulness. The source code is publicly available.

📝 Abstract

Prototypical part network (ProtoPNet) and its variants have drawn wide attention and been applied to various tasks due to their inherent self-explanatory property. Previous ProtoPNets are primarily built upon convolutional neural networks (CNNs). Therefore, it is natural to investigate whether these explainable methods can be advantageous for the recently emerged Vision Transformers (ViTs). However, directly utilizing ViT-backed models as backbones can lead to prototypes paying excessive attention to background positions rather than foreground objects (i.e., the “distraction” problem). To address the problem, this paper proposes prototypical part Transformer (ProtoPFormer) for interpretable image recognition. Based the architectural characteristics of ViTs, we modify the original ProtoPNet by creating separate global and local branches, each accompanied by corresponding prototypes that can capture and highlight representative holistic and partial features. Specifically, the global prototypes can guide local prototypes to concentrate on the foreground and effectively suppress the background influence. Subsequently, local prototypes are explicitly supervised to concentrate on different discriminative visual parts. Finally, the two branches mutually correct each other and jointly make the final decisions. Moreover, extensive experiments demonstrate that ProtoPFormer can consistently achieve superior performance on accuracy, visualization results, and quantitative interpretability evaluation over the state-of-the-art (SOTA) baselines. Our code has been released at https://github.com/zju-vipa/ProtoPFormer.

Problem

Research questions and friction points this paper is trying to address.

Vision Transformers struggle to focus on foreground prototypical parts

Prototypes are distracted by background features in transformer architectures

Existing methods fail to effectively highlight interpretable visual features

Innovation

Methods, ideas, or system contributions that make the work stand out.

Global prototypes guide local prototypes to foreground

Local prototypes focus on specific visual parts

Mutual correction between global and local prototypes

🔎 Similar Papers

ComFe: An Interpretable Head for Vision Transformers