UniCVR: From Alignment to Reranking for Unified Zero-Shot Composed Visual Retrieval

๐Ÿ“… 2026-04-22
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

199K/year
๐Ÿค– AI Summary
This work addresses the lack of a unified zero-shot solution for compositional visual retrieval tasks, which have traditionally been tackled in isolationโ€”namely, compositional image retrieval, multi-turn compositional image retrieval, and compositional video retrieval. The authors propose UniCVR, the first unified zero-shot framework for these tasks, integrating multimodal large language models (MLLMs) with vision-language pretraining (VLP) models through a two-stage pipeline. The approach first aligns heterogeneous embedding spaces across modalities and tasks, then employs an MLLM-guided dual-level adaptive re-ranking mechanism. Key innovations include a clustering-based hard negative sampling strategy and a budget-constrained efficient scoring design. Evaluated on five benchmarks spanning all three tasks, UniCVR consistently achieves state-of-the-art performance, demonstrating its effectiveness and strong generalization capability.

Technology Category

Application Category

๐Ÿ“ Abstract
Composed image retrieval, multi-turn composed image retrieval, and composed video retrieval all share a common paradigm: composing the reference visual with modification text to retrieve the desired target. Despite this shared structure, the three tasks have been studied in isolation, with no prior work proposing a unified framework, let alone a zero-shot solution. In this paper, we propose UniCVR, the first unified zero-shot composed visual retrieval framework that jointly addresses all three tasks without any task-specific human-annotated data. UniCVR strategically combines two complementary strengths: Multimodal Large Language Models (MLLMs) for compositional query understanding and Vision-Language Pre-trained (VLP) models for structured visual retrieval. Concretely, UniCVR operates in two stages. In Stage I, we train the MLLM as a compositional query embedder via contrastive learning on a curated multi-source dataset of approximately 3.5M samples, bridging the heterogeneous embedding spaces between the MLLM and the frozen VLP gallery encoder. A cluster-based hard negative sampling strategy is proposed to strengthen contrastive supervision. In Stage II, we introduce an MLLM-guided dual-level reranking mechanism that applies adaptive budgeted subset scoring to a small number of top-ranked candidates, and then exploits the resulting relevance signals through a dual-level re-scoring scheme, producing more accurate final rankings with minimal computational overhead. Extensive experiments across five benchmarks covering all three tasks demonstrate that UniCVR achieves cutting-edge performance, validating its effectiveness and generalizability. Our data and code will be released upon acceptance.
Problem

Research questions and friction points this paper is trying to address.

composed visual retrieval
zero-shot learning
unified framework
multimodal retrieval
vision-language
Innovation

Methods, ideas, or system contributions that make the work stand out.

zero-shot
unified framework
compositional retrieval
multimodal large language model
dual-level reranking
๐Ÿ”Ž Similar Papers