EvdCLIP: Improving Vision-Language Retrieval with Entity Visual Descriptions from Large Language Models

📅 2025-04-11
🏛️ AAAI Conference on Artificial Intelligence
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address erroneous retrievals in vision-language retrieval caused by missing entity-level visual semantics, this paper proposes the Entity-aware Rewriter (EaRW) framework. First, a large language model (LLM) generates fine-grained Entity Visual Descriptions (EVDs), serving as novel, explicit cues for CLIP’s cross-modal alignment. Second, an end-to-end trainable EVD-aware rewriter (EaRW) performs differentiable query rewriting to enhance visual semantics robustly against noise. The method jointly optimizes contrastive and generative objectives. On Flickr30K and COCO, it achieves an average +2.7% improvement in Recall@1 over prior state-of-the-art methods. Crucially, EVDs introduce the first explicit, interpretable, entity-level visual priors into CLIP—enabling both visual consistency and semantic interpretability in rewritten queries.

Technology Category

Application Category

📝 Abstract
Vision-language retrieval (VLR) has attracted significant attention in both academia and industry, which involves using text (or images) as queries to retrieve corresponding images (or text). However, existing methods often neglect the rich visual semantics knowledge of entities, thus leading to incorrect retrieval results. To address this problem, we propose the Entity Visual Description enhanced CLIP (EvdCLIP), designed to leverage the visual knowledge of entities to enrich queries. Specifically, since humans recognize entities through visual cues, we employ a large language model (LLM) to generate Entity Visual Descriptions (EVDs) as alignment cues to complement textual data. These EVDs are then integrated into raw queries to create visually-rich, EVD-enhanced queries. Furthermore, recognizing that EVD-enhanced queries may introduce noise or low-quality expansions, we develop a novel, trainable EVD-aware Rewriter (EaRW) for vision-language retrieval tasks. EaRW utilizes EVD knowledge and the generative capabilities of the language model to effectively rewrite queries. With our specialized training strategy, EaRW can generate high-quality and low-noise EVD-enhanced queries. Extensive quantitative and qualitative experiments on image-text retrieval benchmarks validate the superiority of EvdCLIP on vision-language retrieval tasks.
Problem

Research questions and friction points this paper is trying to address.

Enhancing vision-language retrieval with entity visual descriptions
Addressing incorrect retrieval due to neglected visual semantics
Reducing noise in queries with trainable EVD-aware Rewriter
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages LLM for Entity Visual Descriptions
Integrates EVDs to enrich raw queries
Uses EVD-aware Rewriter to reduce noise
🔎 Similar Papers
No similar papers found.
G
G. MEng
Tsinghua Shenzhen International Graduate School, Tsinghua University
Sunan He
Sunan He
Hong Kong University of Science and Technology
Multi-Modal Learning
J
Jinpeng Wang
Tsinghua Shenzhen International Graduate School, Tsinghua University
Tao Dai
Tao Dai
Shenzhen University
image restorationcomputer visiondeep learning
Letian Zhang
Letian Zhang
Middle Tennessee State University
Mobile/IoT System DesignEdge IntelligenceNetwork Security
J
Jieming Zhu
Huawei Noah’s Ark Lab
Q
Qing Li
Peng Cheng Laboratory
G
Gang Wang
Huawei Noah’s Ark Lab
R
Rui Zhang
School of Computer Science & Tech, Huazhong University of Science and Technology
Y
Yong Jiang
Tsinghua Shenzhen International Graduate School, Tsinghua University; Peng Cheng Laboratory