VLG-CBM: Training Concept Bottleneck Models with Vision-Language Guidance

📅 2024-07-18

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing concept bottleneck models (CBMs) suffer from concept prediction unfaithfulness and information leakage: predicted concepts often deviate from image content and contain redundant information, undermining both interpretability and performance. To address this, we propose the Vision-Language Guided CBM (VLG-CBM), which introduces— for the first time—a vision-grounding-based concept annotation mechanism leveraging open-vocabulary grounding detectors (e.g., GroundingDINO) to ensure semantic alignment between concepts and visual content. We further design a novel evaluation metric, the Number of Effective Concepts (NEC), to quantify and suppress information leakage at the concept layer, enabling verifiably faithful interpretability. VLG-CBM integrates vision-language pre-trained models with NEC-regularized training. On five benchmarks, it achieves absolute improvements of 4.27–51.09% in ANEC-5 and 0.45–29.78% in ANEC-avg, significantly enhancing concept-image alignment and human interpretability.

Technology Category

Application Category

📝 Abstract

Concept Bottleneck Models (CBMs) provide interpretable prediction by introducing an intermediate Concept Bottleneck Layer (CBL), which encodes human-understandable concepts to explain models' decision. Recent works proposed to utilize Large Language Models and pre-trained Vision-Language Models to automate the training of CBMs, making it more scalable and automated. However, existing approaches still fall short in two aspects: First, the concepts predicted by CBL often mismatch the input image, raising doubts about the faithfulness of interpretation. Second, it has been shown that concept values encode unintended information: even a set of random concepts could achieve comparable test accuracy to state-of-the-art CBMs. To address these critical limitations, in this work, we propose a novel framework called Vision-Language-Guided Concept Bottleneck Model (VLG-CBM) to enable faithful interpretability with the benefits of boosted performance. Our method leverages off-the-shelf open-domain grounded object detectors to provide visually grounded concept annotation, which largely enhances the faithfulness of concept prediction while further improving the model performance. In addition, we propose a new metric called Number of Effective Concepts (NEC) to control the information leakage and provide better interpretability. Extensive evaluations across five standard benchmarks show that our method, VLG-CBM, outperforms existing methods by at least 4.27% and up to 51.09% on Accuracy at NEC=5 (denoted as ANEC-5), and by at least 0.45% and up to 29.78% on average accuracy (denoted as ANEC-avg), while preserving both faithfulness and interpretability of the learned concepts as demonstrated in extensive experiments.

Problem

Research questions and friction points this paper is trying to address.

Concept Bottleneck Models

Reliability

Accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Language Guided Concept Bottleneck Model (VLG-CBM)

Effective Concept Number (NEC)

High-Quality Object Recognition

🔎 Similar Papers

Pre-trained Vision-Language Models Learn Discoverable Visual Concepts