UniRank: End-to-End Domain-Specific Reranking of Hybrid Text-Image Candidates

πŸ“… 2026-02-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of multimodal re-ranking, where modality gaps induce scoring biases and existing open-source models exhibit limited performance in specialized domains. The authors propose UniRank, a novel framework that natively supports end-to-end re-ranking over mixed text–image candidate sets without requiring explicit modality conversion. By integrating instruction tuning with reinforcement learning from human feedback (RLHF), UniRank calibrates cross-modal relevance through a label-to-token likelihood mapping and optimizes its ranking policy using in-domain hard negative samples. This approach effectively bridges the modality gap and enables domain adaptation. Evaluated on scientific literature retrieval and design patent search tasks, UniRank achieves significant improvements over state-of-the-art baselines, with Recall@1 gains of 8.9% and 7.3%, respectively.
πŸ“ Abstract
Reranking is a critical component in many information retrieval pipelines. Despite remarkable progress in text-only settings, multimodal reranking remains challenging, particularly when the candidate set contains hybrid text and image items. A key difficulty is the modality gap: a text reranker is intrinsically closer to text candidates than to image candidates, leading to biased and suboptimal cross-modal ranking. Vision-language models (VLMs) mitigate this gap through strong cross-modal alignment and have recently been adopted to build multimodal rerankers. However, most VLM-based rerankers encode all candidates as images, and treating text as images introduces substantial computational overhead. Meanwhile, existing open-source multimodal rerankers are typically trained on general-domain data and often underperform in domain-specific scenarios. To address these limitations, we propose UniRank, a VLM-based reranking framework that natively scores and orders hybrid text-image candidates without any modality conversion. Building on this hybrid scoring interface, UniRank provides an end-to-end domain adaptation pipeline that includes: (1) an instruction-tuning stage that learns calibrated cross-modal relevance scoring by mapping label-token likelihoods to a unified scalar score; and (2) a hard-negative-driven preference alignment stage that constructs in-domain pairwise preferences and performs query-level policy optimization through reinforcement learning from human feedback (RLHF). Extensive experiments on scientific literature retrieval and design patent search demonstrate that UniRank consistently outperforms state-of-the-art baselines, improving Recall@1 by 8.9% and 7.3%, respectively.
Problem

Research questions and friction points this paper is trying to address.

multimodal reranking
modality gap
domain-specific retrieval
hybrid text-image candidates
cross-modal ranking
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal reranking
vision-language models
domain adaptation
reinforcement learning from human feedback
hybrid text-image retrieval
πŸ”Ž Similar Papers
No similar papers found.