XR: Cross-Modal Agents for Composed Image Retrieval

📅 2026-01-20

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

This work addresses the limitations of traditional embedding-based compositional image retrieval methods, which struggle to effectively integrate cross-modal semantic and visual cues and lack the capacity for reasoning about complex text-image queries. To overcome these challenges, the authors propose XR, a training-free multi-agent framework that reframes the retrieval process as progressive collaborative reasoning. Specifically, three specialized agents—Imagination, Similarity, and Question-Answering—sequentially synthesize target representations, perform coarse candidate filtering, and verify factual consistency. XR is the first approach to introduce a multi-agent mechanism into compositional image retrieval, achieving performance gains of up to 38% over strong baselines on FashionIQ, CIRR, and CIRCO. Ablation studies confirm the necessity of each component, demonstrating a significant departure from and improvement upon conventional embedding paradigms.

Technology Category

Application Category

📝 Abstract

Retrieval is being redefined by agentic AI, demanding multimodal reasoning beyond conventional similarity-based paradigms. Composed Image Retrieval (CIR) exemplifies this shift as each query combines a reference image with textual modifications, requiring compositional understanding across modalities. While embedding-based CIR methods have achieved progress, they remain narrow in perspective, capturing limited cross-modal cues and lacking semantic reasoning. To address these limitations, we introduce XR, a training-free multi-agent framework that reframes retrieval as a progressively coordinated reasoning process. It orchestrates three specialized types of agents: imagination agents synthesize target representations through cross-modal generation, similarity agents perform coarse filtering via hybrid matching, and question agents verify factual consistency through targeted reasoning for fine filtering. Through progressive multi-agent coordination, XR iteratively refines retrieval to meet both semantic and visual query constraints, achieving up to a 38% gain over strong training-free and training-based baselines on FashionIQ, CIRR, and CIRCO, while ablations show each agent is essential. Code is available: https://01yzzyu.github.io/xr.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Composed Image Retrieval

cross-modal reasoning

multimodal understanding

semantic reasoning

image-text retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Modal Agents

Composed Image Retrieval

Training-Free Framework