ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This work addresses the significant degradation in fine-grained reasoning capabilities of generative multimodal large language models (MLLMs) when directly applied to compositional image retrieval, a limitation stemming from a fundamental paradigm mismatch. To bridge this gap, the authors propose ReCALL, a model-agnostic calibration framework that aligns discriminative embedding spaces with MLLM reasoning through a three-stage diagnose–generate–optimize pipeline. ReCALL integrates self-guided sample mining, chain-of-thought prompting to generate corrective instructions and triplets, VQA-based consistency filtering, and grouped contrastive learning. This approach constitutes the first systematic solution to the capability degradation of MLLMs in retrieval tasks, achieving state-of-the-art performance on CIRR and FashionIQ benchmarks and substantially improving compositional retrieval accuracy.

Technology Category

Application Category

📝 Abstract

Composed Image Retrieval (CIR) aims to retrieve target images based on a hybrid query comprising a reference image and a modification text. Early dual-tower Vision-Language Models (VLMs) struggle with cross-modality compositional reasoning required for this task. Recently, adapting generative Multimodal Large Language Models (MLLMs) for retrieval offers a promising direction. However, we identify that this adaptation strategy overlooks a fundamental issue: adapting a generative MLLM into a single-embedding discriminative retriever triggers a paradigm conflict, which leads to Capability Degradation - the deterioration of native fine-grained reasoning after retrieval adaptation. To address this challenge, we propose ReCALL (Recalibrating Capability Degradation), a model-agnostic framework that follows a diagnose-generate-refine pipeline: Firstly, we diagnose cognitive blind spots of the retriever via self-guided informative instance mining. Next, we generate corrective instructions and triplets by CoT prompting the foundation MLLM and conduct quality control with VQA-based consistency filtering. Finally, we refine the retriever through continual training on these triplets with a grouped contrastive scheme, thereby internalizing fine-grained visual-semantic distinctions and realigning the discriminative embedding space of retriever with intrinsic compositional reasoning within the MLLM. Extensive experiments on CIRR and FashionIQ show that ReCALL consistently recalibrates degraded capabilities and achieves state-of-the-art performance. Code will be released soon.

Problem

Research questions and friction points this paper is trying to address.

Composed Image Retrieval

Multimodal Large Language Models

Capability Degradation

Retriever Adaptation

Cross-modality Reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Capability Degradation

Multimodal Large Language Models

Composed Image Retrieval