FineCIR: Explicit Parsing of Fine-Grained Modification Semantics for Composed Image Retrieval

📅 2025-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing compositional image retrieval (CIR) methods rely on coarse-grained modification text (CoarseMT), limiting their ability to model fine-grained retrieval intent—resulting in ambiguous positive samples, severe visual ambiguity, and low retrieval accuracy. To address this, we propose FineCIR, the first CIR framework that explicitly parses fine-grained semantics of modification text. We introduce the first fine-grained CIR annotation protocol and release two benchmark datasets: Fine-FashionIQ and Fine-CIRR. FineCIR integrates a multimodal alignment mechanism with semantic disentanglement, jointly optimizing a text parsing module and a visual entity disambiguation module. Extensive experiments demonstrate that FineCIR achieves state-of-the-art performance on both fine-grained and standard CIR benchmarks, significantly improving retrieval accuracy. All code and datasets are publicly available.

Technology Category

Application Category

📝 Abstract
Composed Image Retrieval (CIR) facilitates image retrieval through a multimodal query consisting of a reference image and modification text. The reference image defines the retrieval context, while the modification text specifies desired alterations. However, existing CIR datasets predominantly employ coarse-grained modification text (CoarseMT), which inadequately captures fine-grained retrieval intents. This limitation introduces two key challenges: (1) ignoring detailed differences leads to imprecise positive samples, and (2) greater ambiguity arises when retrieving visually similar images. These issues degrade retrieval accuracy, necessitating manual result filtering or repeated queries. To address these limitations, we develop a robust fine-grained CIR data annotation pipeline that minimizes imprecise positive samples and enhances CIR systems' ability to discern modification intents accurately. Using this pipeline, we refine the FashionIQ and CIRR datasets to create two fine-grained CIR datasets: Fine-FashionIQ and Fine-CIRR. Furthermore, we introduce FineCIR, the first CIR framework explicitly designed to parse the modification text. FineCIR effectively captures fine-grained modification semantics and aligns them with ambiguous visual entities, enhancing retrieval precision. Extensive experiments demonstrate that FineCIR consistently outperforms state-of-the-art CIR baselines on both fine-grained and traditional CIR benchmark datasets. Our FineCIR code and fine-grained CIR datasets are available at https://github.com/SDU-L/FineCIR.git.
Problem

Research questions and friction points this paper is trying to address.

Addresses imprecise image retrieval from coarse modification texts
Enhances discernment of fine-grained modification intents in CIR
Reduces ambiguity when retrieving visually similar images
Innovation

Methods, ideas, or system contributions that make the work stand out.

Develops fine-grained CIR data annotation pipeline
Introduces FineCIR framework for parsing modification text
Aligns fine-grained semantics with ambiguous visual entities
🔎 Similar Papers
No similar papers found.
Z
Zixu Li
School of Software, Shandong University
Z
Zhiheng Fu
School of Software, Shandong University
Y
Yupeng Hu
School of Software, Shandong University
Z
Zhiwei Chen
School of Software, Shandong University
Haokun Wen
Haokun Wen
Harbin Institute of Technology, Shenzhen
Multimedia ComputingInformation Retrieval
L
Liqiang Nie
School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)