Constrained Prompt Enhancement for Improving Zero-Shot Generalization of Vision-Language Models

📅 2025-08-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models (VLMs) suffer performance degradation in zero-shot transfer due to semantic misalignment between pretraining objectives and downstream tasks. Method: We propose a constrained prompt augmentation framework comprising: (1) generating synonym-rich semantic sets via large language models, then constructing topology-guided enriched text prompts using semantic ambiguity entropy and persistent homology analysis; (2) localizing discriminative visual regions via pretrained model activation maps to suppress background noise; and (3) introducing a test-time adaptive, optimal transport–driven ensemble matching mechanism for fine-grained cross-modal alignment. Contributions/Results: Our method significantly improves zero-shot classification accuracy across multiple benchmarks, effectively mitigating incomplete textual prompting and noisy visual prompting. It enhances cross-dataset generalization capability while preserving modality-specific discriminability and semantic coherence.

Technology Category

Application Category

📝 Abstract
Vision-language models (VLMs) pre-trained on web-scale data exhibit promising zero-shot generalization but often suffer from semantic misalignment due to domain gaps between pre-training and downstream tasks. Existing approaches primarily focus on text prompting with class-specific descriptions and visual-text adaptation via aligning cropped image regions with textual descriptions. However, they still face the issues of incomplete textual prompts and noisy visual prompts. In this paper, we propose a novel constrained prompt enhancement (CPE) method to improve visual-textual alignment by constructing comprehensive textual prompts and compact visual prompts from the semantic perspective. Specifically, our approach consists of two key components: Topology-Guided Synonymous Semantic Generation (TGSSG) and Category-Agnostic Discriminative Region Selection (CADRS). Textually, to address the issue of incomplete semantic expression in textual prompts, our TGSSG first generates synonymous semantic set for each category via large language models, and constructs comprehensive textual prompts based on semantic ambiguity entropy and persistent homology analysis. Visually, to mitigate the irrelevant visual noise introduced by random cropping, our CADRS identifies discriminative regions with activation maps outputted by a pre-trained vision model, effectively filtering out noisy regions and generating compact visual prompts. Given the comprehensive set of textual prompts and compact set of visual prompts, we introduce two set-to-set matching strategies based on test-time adaptation (TTA) and optimal transport (OT) to achieve effective visual-textual alignment, and so improve zero-shot generalization of VLMs.
Problem

Research questions and friction points this paper is trying to address.

Addressing semantic misalignment in vision-language models due to domain gaps
Mitigating incomplete textual prompts and noisy visual prompts
Improving zero-shot generalization through comprehensive visual-textual alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates comprehensive textual prompts via LLMs
Selects discriminative visual regions using activation maps
Uses set-to-set matching with TTA and OT
🔎 Similar Papers
No similar papers found.