๐ค AI Summary
Existing compositional image retrieval methods rely on simplistic text modifications and struggle to handle the complex, multi-edit requirements of real-world scenarios, often suffering from insufficient entity coverage and misalignment between linguistic expressions and visual entities. To address these limitations, this work proposes TEMA, the first unified framework supporting multiple textual edits. TEMA leverages a text-guided entity mapping mechanism, integrating multimodal alignment with fine-grained semantic parsing to enable joint image anchoring and text-driven retrieval. Additionally, the authors introduce two instruction-rich, multi-edit datasetsโM-FashionIQ and M-CIRRโto better reflect practical use cases. Extensive experiments demonstrate that TEMA significantly improves retrieval accuracy across four benchmark datasets, consistently achieving a favorable balance between precision and efficiency in both single- and multi-edit settings.
๐ Abstract
Composed Image Retrieval (CIR) is an important image retrieval paradigm that enables users to retrieve a target image using a multimodal query that consists of a reference image and modification text. Although research on CIR has made significant progress, prevailing setups still rely simple modification texts that typically cover only a limited range of salient changes, which induces two limitations highly relevant to practical applications, namely Insufficient Entity Coverage and Clause-Entity Misalignment. In order to address these issues and bring CIR closer to real-world use, we construct two instruction-rich multi-modification datasets, M-FashionIQ and M-CIRR. In addition, we propose TEMA, the Text-oriented Entity Mapping Architecture, which is the first CIR framework designed for multi-modification while also accommodating simple modifications. Extensive experiments on four benchmark datasets demonstrate that TEMA's superiority in both original and multi-modification scenarios, while maintaining an optimal balance between retrieval accuracy and computational efficiency. Our codes and constructed multi-modification dataset (M-FashionIQ and M-CIRR) are available at https://github.com/lee-zixu/ACL26-TEMA/.