FineCIR: Explicit Parsing of Fine-Grained Modification Semantics for Composed Image Retrieval

📅 2025-03-27

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Existing compositional image retrieval (CIR) methods rely on coarse-grained modification text (CoarseMT), limiting their ability to model fine-grained retrieval intent—resulting in ambiguous positive samples, severe visual ambiguity, and low retrieval accuracy. To address this, we propose FineCIR, the first CIR framework that explicitly parses fine-grained semantics of modification text. We introduce the first fine-grained CIR annotation protocol and release two benchmark datasets: Fine-FashionIQ and Fine-CIRR. FineCIR integrates a multimodal alignment mechanism with semantic disentanglement, jointly optimizing a text parsing module and a visual entity disambiguation module. Extensive experiments demonstrate that FineCIR achieves state-of-the-art performance on both fine-grained and standard CIR benchmarks, significantly improving retrieval accuracy. All code and datasets are publicly available.

Technology Category

Application Category

📝 Abstract

Composed Image Retrieval (CIR) facilitates image retrieval through a multimodal query consisting of a reference image and modification text. The reference image defines the retrieval context, while the modification text specifies desired alterations. However, existing CIR datasets predominantly employ coarse-grained modification text (CoarseMT), which inadequately captures fine-grained retrieval intents. This limitation introduces two key challenges: (1) ignoring detailed differences leads to imprecise positive samples, and (2) greater ambiguity arises when retrieving visually similar images. These issues degrade retrieval accuracy, necessitating manual result filtering or repeated queries. To address these limitations, we develop a robust fine-grained CIR data annotation pipeline that minimizes imprecise positive samples and enhances CIR systems' ability to discern modification intents accurately. Using this pipeline, we refine the FashionIQ and CIRR datasets to create two fine-grained CIR datasets: Fine-FashionIQ and Fine-CIRR. Furthermore, we introduce FineCIR, the first CIR framework explicitly designed to parse the modification text. FineCIR effectively captures fine-grained modification semantics and aligns them with ambiguous visual entities, enhancing retrieval precision. Extensive experiments demonstrate that FineCIR consistently outperforms state-of-the-art CIR baselines on both fine-grained and traditional CIR benchmark datasets. Our FineCIR code and fine-grained CIR datasets are available at https://github.com/SDU-L/FineCIR.git.

Problem

Research questions and friction points this paper is trying to address.

Addresses imprecise image retrieval from coarse modification texts

Enhances discernment of fine-grained modification intents in CIR

Reduces ambiguity when retrieving visually similar images

Innovation

Methods, ideas, or system contributions that make the work stand out.

Develops fine-grained CIR data annotation pipeline

Introduces FineCIR framework for parsing modification text

Aligns fine-grained semantics with ambiguous visual entities

🔎 Similar Papers

No similar papers found.