Discovering Fine-Grained Visual-Concept Relations by Disentangled Optimal Transport Concept Bottleneck Models

📅 2025-05-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing concept bottleneck models (CBMs) only capture coarse-grained, image-level visual-concept associations, leading to spurious correlations and inability to localize decision-relevant regions. To address this, we propose Decoupled Optimal Transport CBM (DOT-CBM), the first CBM that formalizes concept reasoning as a fine-grained optimal transport problem between local image patches and semantic concepts. DOT-CBM incorporates visual saliency and label statistics as transport priors and introduces an orthogonal projection-based decoupling loss to mitigate data bias and shortcut learning. As a result, it generates spatially localized, fine-grained concept explanations. Empirically, DOT-CBM achieves state-of-the-art performance on image classification, part localization, and out-of-distribution generalization—demonstrating substantial improvements in both model reliability and interpretability.

Technology Category

Application Category

📝 Abstract
Concept Bottleneck Models (CBMs) try to make the decision-making process transparent by exploring an intermediate concept space between the input image and the output prediction. Existing CBMs just learn coarse-grained relations between the whole image and the concepts, less considering local image information, leading to two main drawbacks: i) they often produce spurious visual-concept relations, hence decreasing model reliability; and ii) though CBMs could explain the importance of every concept to the final prediction, it is still challenging to tell which visual region produces the prediction. To solve these problems, this paper proposes a Disentangled Optimal Transport CBM (DOT-CBM) framework to explore fine-grained visual-concept relations between local image patches and concepts. Specifically, we model the concept prediction process as a transportation problem between the patches and concepts, thereby achieving explicit fine-grained feature alignment. We also incorporate orthogonal projection losses within the modality to enhance local feature disentanglement. To further address the shortcut issues caused by statistical biases in the data, we utilize the visual saliency map and concept label statistics as transportation priors. Thus, DOT-CBM can visualize inversion heatmaps, provide more reliable concept predictions, and produce more accurate class predictions. Comprehensive experiments demonstrate that our proposed DOT-CBM achieves SOTA performance on several tasks, including image classification, local part detection and out-of-distribution generalization.
Problem

Research questions and friction points this paper is trying to address.

Discover fine-grained visual-concept relations for reliable predictions
Align local image patches with concepts explicitly
Address data biases using transportation priors for accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Disentangled Optimal Transport for fine-grained relations
Orthogonal projection losses enhance feature disentanglement
Visual saliency maps as transportation priors
🔎 Similar Papers
No similar papers found.
Y
Yan Xie
National Key Laboratory of Radar Signal Processing, Xidian University, Xi’an, 710071, China
Zequn Zeng
Zequn Zeng
Xidian University
Vision and languageDeep learningVisual captioning
H
Hao Zhang
National Key Laboratory of Radar Signal Processing, Xidian University, Xi’an, 710071, China
Yucheng Ding
Yucheng Ding
Shanghai Jiao Tong University
Device-Cloud ML
Y
Yi Wang
National Key Laboratory of Radar Signal Processing, Xidian University, Xi’an, 710071, China
Zhengjue Wang
Zhengjue Wang
Xidian University
B
Bo Chen
National Key Laboratory of Radar Signal Processing, Xidian University, Xi’an, 710071, China
H
Hongwei Liu
National Key Laboratory of Radar Signal Processing, Xidian University, Xi’an, 710071, China