Language-driven Fine-grained Retrieval

📅 2025-12-05

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing fine-grained image retrieval (FGIR) methods rely on coarse category-level one-hot labels, yielding sparse semantic supervision that fails to capture cross-category comparability of fine-grained details, thereby limiting generalization to unseen categories. To address this, we propose LaFG—a novel framework that synergistically leverages a large language model (LLM) and a frozen vision-language model (VLM) to construct attribute-level semantic prototypes. Specifically, LaFG first extracts fine-grained attributes from category names via semantic anchor mining; then employs the VLM to align textual attribute descriptions with visual features, building a global attribute lexicon; finally, it generates category-specific linguistic prototypes as dense, attribute-aware supervision signals. This enables explicit modeling of local discriminative details. Evaluated on multiple FGIR benchmarks, LaFG consistently outperforms state-of-the-art methods, achieving substantial gains—particularly in zero-shot cross-category retrieval and generalization to unseen categories.

Technology Category

Application Category

📝 Abstract

Existing fine-grained image retrieval (FGIR) methods learn discriminative embeddings by adopting semantically sparse one-hot labels derived from category names as supervision. While effective on seen classes, such supervision overlooks the rich semantics encoded in category names, hindering the modeling of comparability among cross-category details and, in turn, limiting generalization to unseen categories. To tackle this, we introduce LaFG, a Language-driven framework for Fine-Grained Retrieval that converts class names into attribute-level supervision using large language models (LLMs) and vision-language models (VLMs). Treating each name as a semantic anchor, LaFG prompts an LLM to generate detailed, attribute-oriented descriptions. To mitigate attribute omission in these descriptions, it leverages a frozen VLM to project them into a vision-aligned space, clustering them into a dataset-wide attribute vocabulary while harvesting complementary attributes from related categories. Leveraging this vocabulary, a global prompt template selects category-relevant attributes, which are aggregated into category-specific linguistic prototypes. These prototypes supervise the retrieval model to steer

Problem

Research questions and friction points this paper is trying to address.

Addresses limitations of one-hot labels in fine-grained image retrieval

Enhances generalization to unseen categories via attribute-level supervision

Leverages LLMs and VLMs to create detailed linguistic prototypes for retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM generates attribute descriptions from class names

VLM clusters descriptions into vision-aligned attribute vocabulary

Linguistic prototypes supervise retrieval model for generalization

🔎 Similar Papers

No similar papers found.

Authors to Follow