Pre-trained Vision-Language Models Learn Discoverable Visual Concepts

📅 2024-04-19
🏛️ arXiv.org
📈 Citations: 4
Influential: 1
📄 PDF
🤖 AI Summary
This work investigates whether vision-language models (VLMs) implicitly learn general-purpose visual concepts—such as “spiky” or “brown”—rather than relying on textual shortcuts. To this end, we propose the Concept Discovery and Learning (CDL) framework: it jointly maximizes vision–language mutual information to identify highly discriminative, non-surface-level, and interpretable visual concepts; integrates concept-based prompt engineering with quantitative multi-dataset evaluation; and incorporates human validation. Our study is the first to systematically demonstrate that VLMs inherently acquire broad visual attributes, establishing a novel paradigm for concept definition and evaluation that mitigates text bias while leveraging multimodal discriminability. Evaluated across six recognition benchmarks, the discovered concepts achieve high accuracy, rich semantic content, and strong human interpretability. All code and models are publicly released.

Technology Category

Application Category

📝 Abstract
Do vision-language models (VLMs) pre-trained to caption an image of a"durian"learn visual concepts such as"brown"(color) and"spiky"(texture) at the same time? We aim to answer this question as visual concepts learned"for free"would enable wide applications such as neuro-symbolic reasoning or human-interpretable object classification. We assume that the visual concepts, if captured by pre-trained VLMs, can be extracted by their vision-language interface with text-based concept prompts. We observe that recent works prompting VLMs with concepts often differ in their strategies to define and evaluate the visual concepts, leading to conflicting conclusions. We propose a new concept definition strategy based on two observations: First, certain concept prompts include shortcuts that recognize correct concepts for wrong reasons; Second, multimodal information (e.g. visual discriminativeness, and textual knowledge) should be leveraged when selecting the concepts. Our proposed concept discovery and learning (CDL) framework is thus designed to identify a diverse list of generic visual concepts (e.g."spiky"as opposed to"spiky durian"), which are ranked and selected based on visual and language mutual information. We carefully design quantitative and human evaluations of the discovered concepts on six diverse visual recognition datasets, which confirm that pre-trained VLMs do learn visual concepts that provide accurate and thorough descriptions for the recognized objects. All code and models are publicly released.
Problem

Research questions and friction points this paper is trying to address.

Visual Feature Learning
Image Captioning Models
Evaluation Metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Universal Feature Recognition
Consistency Overcoming
Comprehensive Description Extraction
🔎 Similar Papers
No similar papers found.