Hybrid-Tower: Fine-grained Pseudo-query Interaction and Generation for Text-to-Video Retrieval

📅 2025-09-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the trade-off between low accuracy in dual-tower architectures and poor efficiency in single-tower architectures for text-to-video retrieval (T2VR), this paper proposes Hybrid-Tower: a novel framework that augments the dual-tower structure with a fine-grained pseudo-query generation mechanism. This enables videos to undergo implicit, cross-modal pre-interaction with synthetically generated pseudo-queries—even in the absence of real textual input—thereby achieving fine-grained semantic alignment prior to retrieval. The design harmonizes the high accuracy of single-tower models with the computational efficiency of dual-tower ones. Built upon the CLIP architecture, Hybrid-Tower integrates pseudo-query generation, fine-grained cross-modal interaction, and dual-tower inference. Evaluated on five benchmark datasets, it achieves state-of-the-art–level performance, improving R@1 by 1.6–3.9% while maintaining inference speed comparable to standard dual-tower models. Its core innovation lies in modeling intrinsic video semantics without requiring ground-truth queries and enabling efficient, precise matching via query-free, pseudo-query–driven alignment.

Technology Category

Application Category

📝 Abstract
The Text-to-Video Retrieval (T2VR) task aims to retrieve unlabeled videos by textual queries with the same semantic meanings. Recent CLIP-based approaches have explored two frameworks: Two-Tower versus Single-Tower framework, yet the former suffers from low effectiveness, while the latter suffers from low efficiency. In this study, we explore a new Hybrid-Tower framework that can hybridize the advantages of the Two-Tower and Single-Tower framework, achieving high effectiveness and efficiency simultaneously. We propose a novel hybrid method, Fine-grained Pseudo-query Interaction and Generation for T2VR, ie, PIG, which includes a new pseudo-query generator designed to generate a pseudo-query for each video. This enables the video feature and the textual features of pseudo-query to interact in a fine-grained manner, similar to the Single-Tower approaches to hold high effectiveness, even before the real textual query is received. Simultaneously, our method introduces no additional storage or computational overhead compared to the Two-Tower framework during the inference stage, thus maintaining high efficiency. Extensive experiments on five commonly used text-video retrieval benchmarks demonstrate that our method achieves a significant improvement over the baseline, with an increase of $1.6% sim 3.9%$ in R@1. Furthermore, our method matches the efficiency of Two-Tower models while achieving near state-of-the-art performance, highlighting the advantages of the Hybrid-Tower framework.
Problem

Research questions and friction points this paper is trying to address.

Hybrid framework for text-video retrieval effectiveness and efficiency
Fine-grained pseudo-query interaction to enhance video-text matching
Balancing retrieval performance with computational overhead reduction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid-Tower framework combining two-tower and single-tower advantages
Pseudo-query generator enabling fine-grained video-text feature interaction
Maintains high efficiency with no extra inference overhead
🔎 Similar Papers
No similar papers found.
B
Bangxiang Lan
Renmin University of China
Ruobing Xie
Ruobing Xie
Tencent
Large Language ModelRecommender SystemNatural Language Processing
Ruixiang Zhao
Ruixiang Zhao
Renmin University of China
Xingwu Sun
Xingwu Sun
Tencent
Natural Language ProcessingQuestion AnsweringQuestion Generation
Z
Zhanhui Kang
Large Language Model Department, Tencent
G
Gang Yang
Renmin University of China
X
Xirong Li
Renmin University of China