XR: Cross-Modal Agents for Composed Image Retrieval

📅 2026-01-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of traditional embedding-based compositional image retrieval methods, which struggle to effectively integrate cross-modal semantic and visual cues and lack the capacity for reasoning about complex text-image queries. To overcome these challenges, the authors propose XR, a training-free multi-agent framework that reframes the retrieval process as progressive collaborative reasoning. Specifically, three specialized agents—Imagination, Similarity, and Question-Answering—sequentially synthesize target representations, perform coarse candidate filtering, and verify factual consistency. XR is the first approach to introduce a multi-agent mechanism into compositional image retrieval, achieving performance gains of up to 38% over strong baselines on FashionIQ, CIRR, and CIRCO. Ablation studies confirm the necessity of each component, demonstrating a significant departure from and improvement upon conventional embedding paradigms.

Technology Category

Application Category

📝 Abstract
Retrieval is being redefined by agentic AI, demanding multimodal reasoning beyond conventional similarity-based paradigms. Composed Image Retrieval (CIR) exemplifies this shift as each query combines a reference image with textual modifications, requiring compositional understanding across modalities. While embedding-based CIR methods have achieved progress, they remain narrow in perspective, capturing limited cross-modal cues and lacking semantic reasoning. To address these limitations, we introduce XR, a training-free multi-agent framework that reframes retrieval as a progressively coordinated reasoning process. It orchestrates three specialized types of agents: imagination agents synthesize target representations through cross-modal generation, similarity agents perform coarse filtering via hybrid matching, and question agents verify factual consistency through targeted reasoning for fine filtering. Through progressive multi-agent coordination, XR iteratively refines retrieval to meet both semantic and visual query constraints, achieving up to a 38% gain over strong training-free and training-based baselines on FashionIQ, CIRR, and CIRCO, while ablations show each agent is essential. Code is available: https://01yzzyu.github.io/xr.github.io/.
Problem

Research questions and friction points this paper is trying to address.

Composed Image Retrieval
cross-modal reasoning
multimodal understanding
semantic reasoning
image-text retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Modal Agents
Composed Image Retrieval
Training-Free Framework
Multi-Agent Reasoning
Progressive Coordination
🔎 Similar Papers
No similar papers found.