🤖 AI Summary
This work addresses the challenge that Concept Bottleneck Models (CBMs) rely on high-quality concept annotations, which are scarce in practice, while concepts generated solely by Vision-Language Models (VLMs) often lack sufficient fidelity, undermining model interpretability. To overcome this limitation, the authors propose VH-CBM, a novel approach that effectively integrates VLM-derived priors with minimal human-annotated dense labels. By leveraging Gaussian processes to propagate expert supervision within the VLM embedding space and incorporating active learning to optimize annotation efficiency, VH-CBM achieves substantial improvements in both concept prediction accuracy and calibration. Experimental results demonstrate that with only 1% of the annotated data, VH-CBM significantly outperforms purely VLM-guided CBM variants.
📝 Abstract
Concept-bottleneck models (CBMs) are neural classifiers that compute predictions from high-level concepts extracted from the input. CBMs ensure stakeholders can understand the concepts -- and the predictions they entail -- by learning these from concept-level annotations, which are however seldom available. Recent CBM architectures work around this issue by obtaining annotations from Vision-Language Models (VLMs). While greatly broadening applicability, doing so can yield lower quality concepts and therefore less interpretable models. We strike for a middle ground by introducing Vision-plus-Human-guided CBM (VH-CBM), a hybrid approach that exploits both VLMs and a small amount of dense annotations. VH-CBM employs a Gaussian Process in the VLM's embedding space, which captures useful global information about the target domain, to propagate the expert's supervision to any target data point. Our empirical evaluation shows how VH-CBM predicts more accurate concepts than VLM-guided CBMs even when annotating as little as 1% of the data, while sporting better concept calibration and supporting active learning.