Beyond Heuristic Prompting: A Concept-Guided Bayesian Framework for Zero-Shot Image Recognition

📅 2026-03-09

📈 Citations: 0

✨ Influential: 0

career value

160K/year

🤖 AI Summary

This work addresses the limitations of current vision-language models in zero-shot image recognition, which are hindered by heuristic prompt engineering and sensitivity to anomalous prompts. The authors reformulate the task from a Bayesian perspective, modeling class concepts as latent variables and performing prediction via marginalization over the concept space. To enhance robustness, they introduce a training-free adaptive soft-truncation likelihood mechanism. Their approach leverages large language models to generate diverse multi-stage concepts and employs determinantal point processes (DPPs) to promote concept diversity. The method is supported by a theoretical excess risk bound and achieves state-of-the-art performance across multiple benchmarks, significantly improving zero-shot recognition accuracy.

Technology Category

Application Category

📝 Abstract

Vision-Language Models (VLMs), such as CLIP, have significantly advanced zero-shot image recognition. However, their performance remains limited by suboptimal prompt engineering and poor adaptability to target classes. While recent methods attempt to improve prompts through diverse class descriptions, they often rely on heuristic designs, lack versatility, and are vulnerable to outlier prompts. This paper enhances prompt by incorporating class-specific concepts. By treating concepts as latent variables, we rethink zero-shot image classification from a Bayesian perspective, casting prediction as marginalization over the concept space, where each concept is weighted by a prior and a test-image conditioned likelihood. This formulation underscores the importance of both a well-structured concept proposal distribution and the refinement of concept priors. To construct an expressive and efficient proposal distribution, we introduce a multi-stage concept synthesis pipeline driven by LLMs to generate discriminative and compositional concepts, followed by a Determinantal Point Process to enforce diversity. To mitigate the influence of outlier concepts, we propose a training-free, adaptive soft-trim likelihood, which attenuates their impact in a single forward pass. We further provide robustness guarantees and derive multi-class excess risk bounds for our framework. Extensive experiments demonstrate that our method consistently outperforms state-of-the-art approaches, validating its effectiveness in zero-shot image classification. Our code is available at https://github.com/less-and-less-bugs/CGBC.

Problem

Research questions and friction points this paper is trying to address.

zero-shot image recognition

prompt engineering

vision-language models

outlier prompts

class adaptability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Concept-Guided Bayesian Framework

Zero-Shot Image Recognition

Large Language Models (LLMs)