Do Composed Image Retrieval Benchmarks Require Multimodal Composition?

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work investigates whether existing Composed Image Retrieval (CIR) benchmarks genuinely assess multimodal compositionality or are susceptible to unimodal shortcuts. The study systematically uncovers the pervasive presence of such shortcuts in current CIR benchmarks and introduces a two-stage auditing framework: first identifying potentially shortcut-prone queries via cross-model consistency analysis, then constructing a high-quality subset of 1,689 shortcut-free compositional queries through human validation. Re-evaluating state-of-the-art models on this refined subset reveals a significant drop in accuracy alongside markedly increased reliance on multimodal reasoning, indicating that original benchmarks substantially overestimate true compositional capabilities. Experiments further show that 32.2%–83.6% of original queries can be resolved using unimodal cues alone, underscoring the critical need for more reliable evaluation protocols.

📝 Abstract

Composed Image Retrieval (CIR) is a multimodal retrieval task where a query consists of a reference image and a textual modification, and the goal is to retrieve a target image satisfying both. In principle, strong performance on CIR benchmarks is assumed to require multimodal composition, i.e., combining complementary information from reference image and textual modification. In this work, we show that this assumption does not always hold. Across four widely used CIR benchmarks and eleven Generalist Multimodal Embedding models, a large fraction of queries can be solved using a single modality (from 32.2% to 83.6%), revealing pervasive unimodal shortcuts. Thus, high CIR performance can arise from unimodal signals rather than true multimodal composition. To better understand this issue, we perform a two-stage audit. First, we identify shortcut-solvable queries through cross-model analysis. Second, we conduct human validation on 4,741 shortcut-free queries, of which only 1,689 are well-formed, with common issues including ambiguous edits and mismatched targets. Re-evaluating models on this validated subset reveals qualitatively different behaviour: queries can no longer be solved with a single modality, and successful retrieval requires combining both inputs. While accuracy decreases, reliance on multimodal information increases. Overall, current CIR benchmarks conflate shortcut-solvable, noisy, and genuinely compositional queries, leading to an overestimation of model capability in multimodal composition.

Problem

Research questions and friction points this paper is trying to address.

Composed Image Retrieval

multimodal composition

unimodal shortcuts

benchmark evaluation

multimodal embedding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Composed Image Retrieval

multimodal composition

unimodal shortcuts