UniRank: End-to-End Domain-Specific Reranking of Hybrid Text-Image Candidates

📅 2026-02-08

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the challenge of multimodal re-ranking, where modality gaps induce scoring biases and existing open-source models exhibit limited performance in specialized domains. The authors propose UniRank, a novel framework that natively supports end-to-end re-ranking over mixed text–image candidate sets without requiring explicit modality conversion. By integrating instruction tuning with reinforcement learning from human feedback (RLHF), UniRank calibrates cross-modal relevance through a label-to-token likelihood mapping and optimizes its ranking policy using in-domain hard negative samples. This approach effectively bridges the modality gap and enables domain adaptation. Evaluated on scientific literature retrieval and design patent search tasks, UniRank achieves significant improvements over state-of-the-art baselines, with Recall@1 gains of 8.9% and 7.3%, respectively.

📝 Abstract

Reranking is a critical component in many information retrieval pipelines. Despite remarkable progress in text-only settings, multimodal reranking remains challenging, particularly when the candidate set contains hybrid text and image items. A key difficulty is the modality gap: a text reranker is intrinsically closer to text candidates than to image candidates, leading to biased and suboptimal cross-modal ranking. Vision-language models (VLMs) mitigate this gap through strong cross-modal alignment and have recently been adopted to build multimodal rerankers. However, most VLM-based rerankers encode all candidates as images, and treating text as images introduces substantial computational overhead. Meanwhile, existing open-source multimodal rerankers are typically trained on general-domain data and often underperform in domain-specific scenarios. To address these limitations, we propose UniRank, a VLM-based reranking framework that natively scores and orders hybrid text-image candidates without any modality conversion. Building on this hybrid scoring interface, UniRank provides an end-to-end domain adaptation pipeline that includes: (1) an instruction-tuning stage that learns calibrated cross-modal relevance scoring by mapping label-token likelihoods to a unified scalar score; and (2) a hard-negative-driven preference alignment stage that constructs in-domain pairwise preferences and performs query-level policy optimization through reinforcement learning from human feedback (RLHF). Extensive experiments on scientific literature retrieval and design patent search demonstrate that UniRank consistently outperforms state-of-the-art baselines, improving Recall@1 by 8.9% and 7.3%, respectively.

Problem

Research questions and friction points this paper is trying to address.

multimodal reranking

modality gap

domain-specific retrieval

hybrid text-image candidates

cross-modal ranking

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal reranking

vision-language models

domain adaptation