Constrained Prompt Enhancement for Improving Zero-Shot Generalization of Vision-Language Models

📅 2025-08-24

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Vision-language models (VLMs) suffer performance degradation in zero-shot transfer due to semantic misalignment between pretraining objectives and downstream tasks. Method: We propose a constrained prompt augmentation framework comprising: (1) generating synonym-rich semantic sets via large language models, then constructing topology-guided enriched text prompts using semantic ambiguity entropy and persistent homology analysis; (2) localizing discriminative visual regions via pretrained model activation maps to suppress background noise; and (3) introducing a test-time adaptive, optimal transport–driven ensemble matching mechanism for fine-grained cross-modal alignment. Contributions/Results: Our method significantly improves zero-shot classification accuracy across multiple benchmarks, effectively mitigating incomplete textual prompting and noisy visual prompting. It enhances cross-dataset generalization capability while preserving modality-specific discriminability and semantic coherence.

Technology Category

Application Category

📝 Abstract

Vision-language models (VLMs) pre-trained on web-scale data exhibit promising zero-shot generalization but often suffer from semantic misalignment due to domain gaps between pre-training and downstream tasks. Existing approaches primarily focus on text prompting with class-specific descriptions and visual-text adaptation via aligning cropped image regions with textual descriptions. However, they still face the issues of incomplete textual prompts and noisy visual prompts. In this paper, we propose a novel constrained prompt enhancement (CPE) method to improve visual-textual alignment by constructing comprehensive textual prompts and compact visual prompts from the semantic perspective. Specifically, our approach consists of two key components: Topology-Guided Synonymous Semantic Generation (TGSSG) and Category-Agnostic Discriminative Region Selection (CADRS). Textually, to address the issue of incomplete semantic expression in textual prompts, our TGSSG first generates synonymous semantic set for each category via large language models, and constructs comprehensive textual prompts based on semantic ambiguity entropy and persistent homology analysis. Visually, to mitigate the irrelevant visual noise introduced by random cropping, our CADRS identifies discriminative regions with activation maps outputted by a pre-trained vision model, effectively filtering out noisy regions and generating compact visual prompts. Given the comprehensive set of textual prompts and compact set of visual prompts, we introduce two set-to-set matching strategies based on test-time adaptation (TTA) and optimal transport (OT) to achieve effective visual-textual alignment, and so improve zero-shot generalization of VLMs.

Problem

Research questions and friction points this paper is trying to address.

Addressing semantic misalignment in vision-language models due to domain gaps

Mitigating incomplete textual prompts and noisy visual prompts

Improving zero-shot generalization through comprehensive visual-textual alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates comprehensive textual prompts via LLMs

Selects discriminative visual regions using activation maps

Uses set-to-set matching with TTA and OT

🔎 Similar Papers

What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models