CalibCLIP: Contextual Calibration of Dominant Semantics for Text-Driven Image Retrieval

📅 2025-10-07

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing vision-language models (VLMs) suffer from semantic imbalance in text-driven image retrieval: a few dominant visual tokens over-represent global semantics, suppressing discriminative local features. To address this, we propose CalibCLIP—a training-free framework that achieves cross-modal semantic alignment via joint calibration of visual and textual embedding spaces. Its core innovations include a Contrastive Visual Enhancer, which dynamically suppresses dominant visual tokens, and a Discriminative Concept Calibrator, which strengthens fine-grained semantic cues from text—both enabled by feature disentanglement and dynamic representation suppression. Crucially, CalibCLIP operates without any fine-tuning. Evaluated across three retrieval paradigms and seven standard benchmarks, it delivers consistent performance gains, significantly improving model discriminability and generalization.

Technology Category

Application Category

📝 Abstract

Existing Visual Language Models (VLMs) suffer structural limitations where a few low contribution tokens may excessively capture global semantics, dominating the information aggregation process and suppressing the discriminative features in text-driven image retrieval tasks. To address this, we introduce extbf{CalibCLIP}, a training-free method designed to calibrate the suppressive effect of dominant tokens. Specifically, in the visual space, we propose the Contrastive Visual Enhancer (CVE), which decouples visual features into target and low information regions. Subsequently, it identifies dominant tokens and dynamically suppresses their representations.In the textual space, we introduce the Discriminative Concept Calibrator (DCC), which aims to differentiate between general and discriminative concepts within the text query. By mitigating the challenges posed by generic concepts and improving the representations of discriminative concepts, DCC strengthens the differentiation among similar samples. Finally, extensive experiments demonstrate consistent improvements across seven benchmarks spanning three image retrieval tasks, underscoring the effectiveness of CalibCLIP. Code is available at: https://github.com/kangbin98/CalibCLIP

Problem

Research questions and friction points this paper is trying to address.

Addresses dominant token suppression in text-driven image retrieval tasks

Calibrates visual and textual representations to enhance discriminative features

Improves retrieval accuracy across multiple benchmarks without requiring training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples visual features into target and low information regions

Dynamically suppresses dominant token representations in visual space

Differentiates general and discriminative concepts within text queries

🔎 Similar Papers

No similar papers found.

Authors to Follow