Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers

📅 2025-11-03

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

To address the challenge of fusing heterogeneous retrieval results in multimodal video retrieval, this paper proposes Vote-in-Context (ViC), the first framework leveraging vision-language models (VLMs) for zero-shot listwise re-ranking and fusion. ViC’s core innovation lies in serializing candidate videos, their retriever metadata, and subtitle evidence into a compact S-Grid representation embedded within the prompt—enabling VLMs to perform cross-modal joint reasoning and adaptive re-ranking without fine-tuning. The method is fully training-free, integrating prompt engineering, subtitle augmentation, and multi-source metadata modeling. Evaluated on ActivityNet, VATEX, and MSR-VTT, ViC achieves state-of-the-art performance: zero-shot Recall@1 improves by up to 40 percentage points, and it attains 99.6% accuracy on the VATEX v2t task.

Technology Category

Application Category

📝 Abstract

In the retrieval domain, candidates' fusion from heterogeneous retrievers is a long-standing challenge, particularly for complex, multi-modal data such as videos. While typical fusion techniques are training-free, they rely solely on rank or score signals, disregarding candidates' representations. This work introduces Vote-in-Context (ViC), a generalized, training-free framework that re-thinks list-wise reranking and fusion as a zero-shot reasoning task for a Vision-Language Model (VLM). The core insight is to serialize both content evidence and retriever metadata directly within the VLM's prompt, allowing the model to adaptively weigh retriever consensus against visual-linguistic content. We demonstrate the generality of this framework by applying it to the challenging domain of cross-modal video retrieval. To this end, we introduce the S-Grid, a compact serialization map that represents each video as an image grid, optionally paired with subtitles to enable list-wise reasoning over video candidates. ViC is evaluated both as a single-list reranker, where it dramatically improves the precision of individual retrievers, and as an ensemble fuser, where it consistently outperforms strong baselines like CombSUM. Across video retrieval benchmarks including ActivityNet and VATEX, the framework establishes new state-of-the-art zero-shot retrieval performance, demonstrating its effectiveness in handling complex visual and temporal signals alongside text. In zero-shot settings, ViC achieves Recall@1 scores of 87.1% (t2v) / 89.0% (v2t) on MSR-VTT and 99.6% (v2t) on VATEX, representing massive gains of up to +40 Recall@1 over previous state-of-the-art baselines. We present ViC as a simple, reproducible, and highly effective recipe for turning modern VLMs into powerful zero-shot rerankers and fusers. Code and resources are publicly available at: https://github.com/mohammad2012191/ViC

Problem

Research questions and friction points this paper is trying to address.

Fusing heterogeneous retrievers for complex multi-modal data

Adaptively weighing retriever consensus against visual-linguistic content

Achieving zero-shot reranking and fusion in cross-modal video retrieval

Innovation

Methods, ideas, or system contributions that make the work stand out.

VLM serializes content evidence and retriever metadata

S-Grid compactly represents videos as image grids

Training-free framework enables zero-shot reasoning reranking

🔎 Similar Papers

Single Image Unlearning: Efficient Machine Unlearning in Multimodal Large Language Models