Enhancing Zero-Shot Image Recognition in Vision-Language Models through Human-like Concept Guidance

📅 2025-03-20

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Existing vision-language models suffer from limited zero-shot image classification generalization due to rigid prompt engineering and poor adaptability to target categories. To address this, we propose a concept-guided Bayesian inference framework inspired by human compositional reasoning—leveraging known semantic concepts to recognize novel classes. Our approach first models semantic concepts as latent variables within a generative Bayesian model. We then design an importance sampling algorithm tailored for infinite concept spaces, augmented by LLM-generated discriminative concepts. Furthermore, we introduce three dynamic likelihood evaluation strategies to enable test-time adaptive concept fusion. The method synergistically integrates Bayesian modeling, LLM-based prompt generation, and test-time adaptation (TTA). Evaluated across 15 benchmarks, our method consistently outperforms state-of-the-art approaches, achieving significant accuracy gains in cross-domain, fine-grained, and long-tailed recognition tasks.

Technology Category

Application Category

📝 Abstract

In zero-shot image recognition tasks, humans demonstrate remarkable flexibility in classifying unseen categories by composing known simpler concepts. However, existing vision-language models (VLMs), despite achieving significant progress through large-scale natural language supervision, often underperform in real-world applications because of sub-optimal prompt engineering and the inability to adapt effectively to target classes. To address these issues, we propose a Concept-guided Human-like Bayesian Reasoning (CHBR) framework. Grounded in Bayes' theorem, CHBR models the concept used in human image recognition as latent variables and formulates this task by summing across potential concepts, weighted by a prior distribution and a likelihood function. To tackle the intractable computation over an infinite concept space, we introduce an importance sampling algorithm that iteratively prompts large language models (LLMs) to generate discriminative concepts, emphasizing inter-class differences. We further propose three heuristic approaches involving Average Likelihood, Confidence Likelihood, and Test Time Augmentation (TTA) Likelihood, which dynamically refine the combination of concepts based on the test image. Extensive evaluations across fifteen datasets demonstrate that CHBR consistently outperforms existing state-of-the-art zero-shot generalization methods.

Problem

Research questions and friction points this paper is trying to address.

Improves zero-shot image recognition using human-like concept guidance.

Addresses sub-optimal prompt engineering in vision-language models.

Enhances adaptability to unseen categories through Bayesian reasoning.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Concept-guided Human-like Bayesian Reasoning framework

Importance sampling algorithm with LLM prompts

Dynamic concept refinement using heuristic approaches

🔎 Similar Papers

What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models