🤖 AI Summary
Existing vision-language models suffer from limited zero-shot image classification generalization due to rigid prompt engineering and poor adaptability to target categories. To address this, we propose a concept-guided Bayesian inference framework inspired by human compositional reasoning—leveraging known semantic concepts to recognize novel classes. Our approach first models semantic concepts as latent variables within a generative Bayesian model. We then design an importance sampling algorithm tailored for infinite concept spaces, augmented by LLM-generated discriminative concepts. Furthermore, we introduce three dynamic likelihood evaluation strategies to enable test-time adaptive concept fusion. The method synergistically integrates Bayesian modeling, LLM-based prompt generation, and test-time adaptation (TTA). Evaluated across 15 benchmarks, our method consistently outperforms state-of-the-art approaches, achieving significant accuracy gains in cross-domain, fine-grained, and long-tailed recognition tasks.
📝 Abstract
In zero-shot image recognition tasks, humans demonstrate remarkable flexibility in classifying unseen categories by composing known simpler concepts. However, existing vision-language models (VLMs), despite achieving significant progress through large-scale natural language supervision, often underperform in real-world applications because of sub-optimal prompt engineering and the inability to adapt effectively to target classes. To address these issues, we propose a Concept-guided Human-like Bayesian Reasoning (CHBR) framework. Grounded in Bayes' theorem, CHBR models the concept used in human image recognition as latent variables and formulates this task by summing across potential concepts, weighted by a prior distribution and a likelihood function. To tackle the intractable computation over an infinite concept space, we introduce an importance sampling algorithm that iteratively prompts large language models (LLMs) to generate discriminative concepts, emphasizing inter-class differences. We further propose three heuristic approaches involving Average Likelihood, Confidence Likelihood, and Test Time Augmentation (TTA) Likelihood, which dynamically refine the combination of concepts based on the test image. Extensive evaluations across fifteen datasets demonstrate that CHBR consistently outperforms existing state-of-the-art zero-shot generalization methods.