🤖 AI Summary
To address the limitations of CLIP-style models—namely, impoverished semantic expressiveness and constrained cross-modal alignment due to reliance on single-text prompts—this paper proposes a context-adaptive multi-prompt embedding method. The approach introduces structured prompt templates augmented with learnable context tokens, enabling joint encoding of multiple prompts to yield more discriminative text representations. It further incorporates diversity regularization and negation-aware contrastive loss to explicitly enhance semantic richness and inter-class separability. Crucially, multi-prompt encoding, joint forward propagation, and end-to-end optimization are unified within a single framework. Extensive experiments demonstrate consistent performance gains on both image–text and video–text retrieval benchmarks, validating the method’s effectiveness in improving fine-grained semantic alignment across modalities.
📝 Abstract
We propose Context-Adaptive Multi-Prompt Embedding, a novel approach to enrich semantic representations in vision-language contrastive learning. Unlike standard CLIP-style models that rely on a single text embedding, our method introduces multiple structured prompts, each containing a distinct adaptive token that captures diverse semantic aspects of the input text. We process all prompts jointly in a single forward pass. The resulting prompt embeddings are combined into a unified text representation, enabling semantically richer alignment with visual features. To further promote semantic diversity and representation quality, we incorporate a diversity regularization loss and a negation-aware loss, encouraging specialization across prompts and improving contrastive discrimination. Our method achieves consistent improvements on both image-text and video-text retrieval benchmarks.