CalibCLIP: Contextual Calibration of Dominant Semantics for Text-Driven Image Retrieval

📅 2025-10-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language models (VLMs) suffer from semantic imbalance in text-driven image retrieval: a few dominant visual tokens over-represent global semantics, suppressing discriminative local features. To address this, we propose CalibCLIP—a training-free framework that achieves cross-modal semantic alignment via joint calibration of visual and textual embedding spaces. Its core innovations include a Contrastive Visual Enhancer, which dynamically suppresses dominant visual tokens, and a Discriminative Concept Calibrator, which strengthens fine-grained semantic cues from text—both enabled by feature disentanglement and dynamic representation suppression. Crucially, CalibCLIP operates without any fine-tuning. Evaluated across three retrieval paradigms and seven standard benchmarks, it delivers consistent performance gains, significantly improving model discriminability and generalization.

Technology Category

Application Category

📝 Abstract
Existing Visual Language Models (VLMs) suffer structural limitations where a few low contribution tokens may excessively capture global semantics, dominating the information aggregation process and suppressing the discriminative features in text-driven image retrieval tasks. To address this, we introduce extbf{CalibCLIP}, a training-free method designed to calibrate the suppressive effect of dominant tokens. Specifically, in the visual space, we propose the Contrastive Visual Enhancer (CVE), which decouples visual features into target and low information regions. Subsequently, it identifies dominant tokens and dynamically suppresses their representations.In the textual space, we introduce the Discriminative Concept Calibrator (DCC), which aims to differentiate between general and discriminative concepts within the text query. By mitigating the challenges posed by generic concepts and improving the representations of discriminative concepts, DCC strengthens the differentiation among similar samples. Finally, extensive experiments demonstrate consistent improvements across seven benchmarks spanning three image retrieval tasks, underscoring the effectiveness of CalibCLIP. Code is available at: https://github.com/kangbin98/CalibCLIP
Problem

Research questions and friction points this paper is trying to address.

Addresses dominant token suppression in text-driven image retrieval tasks
Calibrates visual and textual representations to enhance discriminative features
Improves retrieval accuracy across multiple benchmarks without requiring training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decouples visual features into target and low information regions
Dynamically suppresses dominant token representations in visual space
Differentiates general and discriminative concepts within text queries
🔎 Similar Papers
No similar papers found.
B
Bin Kang
Chengdu Institute of Computer Applications, Chinese Academy of Sciences, Chengdu, China; University of Chinese Academy of Sciences, Beijing, China
B
Bin Chen
International Research Institute for Artificial Intelligence, Harbin Institute of Technology (Shenzhen), Shenzhen, China
J
Junjie Wang
Harbin Institute of Technology (Shenzhen), Shenzhen, China
Yulin Li
Yulin Li
The Hong Kong University of Science and Technology
Optimiation TheoryRobot Motion Planning&Control
J
Junzhi Zhao
Southwest Jiaotong University, Chengdu, China
Zhuotao Tian
Zhuotao Tian
Professor, Harbin Institute of Technology (Shenzhen)
Vision-language ModelMulti-modal PerceptionComputer Vision