Language-driven Fine-grained Retrieval

📅 2025-12-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing fine-grained image retrieval (FGIR) methods rely on coarse category-level one-hot labels, yielding sparse semantic supervision that fails to capture cross-category comparability of fine-grained details, thereby limiting generalization to unseen categories. To address this, we propose LaFG—a novel framework that synergistically leverages a large language model (LLM) and a frozen vision-language model (VLM) to construct attribute-level semantic prototypes. Specifically, LaFG first extracts fine-grained attributes from category names via semantic anchor mining; then employs the VLM to align textual attribute descriptions with visual features, building a global attribute lexicon; finally, it generates category-specific linguistic prototypes as dense, attribute-aware supervision signals. This enables explicit modeling of local discriminative details. Evaluated on multiple FGIR benchmarks, LaFG consistently outperforms state-of-the-art methods, achieving substantial gains—particularly in zero-shot cross-category retrieval and generalization to unseen categories.

Technology Category

Application Category

📝 Abstract
Existing fine-grained image retrieval (FGIR) methods learn discriminative embeddings by adopting semantically sparse one-hot labels derived from category names as supervision. While effective on seen classes, such supervision overlooks the rich semantics encoded in category names, hindering the modeling of comparability among cross-category details and, in turn, limiting generalization to unseen categories. To tackle this, we introduce LaFG, a Language-driven framework for Fine-Grained Retrieval that converts class names into attribute-level supervision using large language models (LLMs) and vision-language models (VLMs). Treating each name as a semantic anchor, LaFG prompts an LLM to generate detailed, attribute-oriented descriptions. To mitigate attribute omission in these descriptions, it leverages a frozen VLM to project them into a vision-aligned space, clustering them into a dataset-wide attribute vocabulary while harvesting complementary attributes from related categories. Leveraging this vocabulary, a global prompt template selects category-relevant attributes, which are aggregated into category-specific linguistic prototypes. These prototypes supervise the retrieval model to steer
Problem

Research questions and friction points this paper is trying to address.

Addresses limitations of one-hot labels in fine-grained image retrieval
Enhances generalization to unseen categories via attribute-level supervision
Leverages LLMs and VLMs to create detailed linguistic prototypes for retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM generates attribute descriptions from class names
VLM clusters descriptions into vision-aligned attribute vocabulary
Linguistic prototypes supervise retrieval model for generalization
🔎 Similar Papers
No similar papers found.
S
Shijie Wang
The University of Queensland, Austrilia
X
Xin Yu
The University of Queensland, Austrilia
Yadan Luo
Yadan Luo
ARC DECRA and Senior Lecturer, University of Queensland
Generalization3D VisionAutonomous Driving
Z
Zijian Wang
The University of Queensland, Austrilia
P
Pengfei Zhang
The University of Queensland, Austrilia
Zi Huang
Zi Huang
PhD Candidate
Deep Learning