Generative Recall, Dense Reranking: Learning Multi-View Semantic IDs for Efficient Text-to-Video Retrieval

📅 2026-01-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge in text-to-video retrieval where conventional two-stage approaches struggle to balance recall efficiency and accuracy due to semantic ambiguity and cross-modal misalignment. The authors propose the GRDR framework, which introduces a multi-view semantic ID mechanism during the recall stage. By jointly training a query-guided tokenizer with a shared codebook, GRDR generates diverse semantic pathways and integrates Trie-constrained decoding with dense re-ranking to achieve both high efficiency and precision. Notably, this is the first approach to incorporate generative retrieval as a high-quality recall module within a two-stage architecture. On mainstream TVR benchmarks, GRDR matches the accuracy of strong dense retrievers while reducing index storage by an order of magnitude and accelerating full-database retrieval by up to 300×.

Technology Category

Application Category

📝 Abstract
Text-to-Video Retrieval (TVR) is essential in video platforms. Dense retrieval with dual-modality encoders leads in accuracy, but its computation and storage scale poorly with corpus size. Thus, real-time large-scale applications adopt two-stage retrieval, where a fast recall model gathers a small candidate pool, which is reranked by an advanced dense retriever. Due to hugely reduced candidates, the reranking model can use any off-the-shelf dense retriever without hurting efficiency, meaning the recall model bounds two-stage TVR performance. Recently, generative retrieval (GR) replaces dense video embeddings with discrete semantic IDs and retrieves by decoding text queries into ID tokens. GR offers near-constant inference and storage complexity, and its semantic IDs capture high-level video features via quantization, making it ideal for quickly eliminating irrelevant candidates during recall. However, as a recall model in two-stage TVR, GR suffers from (i) semantic ambiguity, where each video satisfies diverse queries but is forced into one semantic ID; and (ii) cross-modal misalignment, as semantic IDs are solely derived from visual features without text supervision. We propose Generative Recall and Dense Reranking (GRDR), designing a novel GR method to uplift recalled candidate quality. GRDR assigns multiple semantic IDs to each video using a query-guided multi-view tokenizer exposing diverse semantic access paths, and jointly trains the tokenizer and generative retriever via a shared codebook to cast semantic IDs as the semantic bridge between texts and videos. At inference, trie-constrained decoding generates a compact candidate set reranked by a dense model for fine-grained matching. Experiments on TVR benchmarks show GRDR matches strong dense retrievers in accuracy while reducing index storage by an order of magnitude and accelerating up to 300$\times$ in full-corpus retrieval.
Problem

Research questions and friction points this paper is trying to address.

Text-to-Video Retrieval
Generative Retrieval
Semantic Ambiguity
Cross-modal Misalignment
Two-stage Retrieval
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative Retrieval
Multi-View Semantic IDs
Text-to-Video Retrieval
Dense Reranking
Cross-Modal Alignment
🔎 Similar Papers
No similar papers found.