🤖 AI Summary
To address the trade-off between low accuracy in dual-tower architectures and poor efficiency in single-tower architectures for text-to-video retrieval (T2VR), this paper proposes Hybrid-Tower: a novel framework that augments the dual-tower structure with a fine-grained pseudo-query generation mechanism. This enables videos to undergo implicit, cross-modal pre-interaction with synthetically generated pseudo-queries—even in the absence of real textual input—thereby achieving fine-grained semantic alignment prior to retrieval. The design harmonizes the high accuracy of single-tower models with the computational efficiency of dual-tower ones. Built upon the CLIP architecture, Hybrid-Tower integrates pseudo-query generation, fine-grained cross-modal interaction, and dual-tower inference. Evaluated on five benchmark datasets, it achieves state-of-the-art–level performance, improving R@1 by 1.6–3.9% while maintaining inference speed comparable to standard dual-tower models. Its core innovation lies in modeling intrinsic video semantics without requiring ground-truth queries and enabling efficient, precise matching via query-free, pseudo-query–driven alignment.
📝 Abstract
The Text-to-Video Retrieval (T2VR) task aims to retrieve unlabeled videos by textual queries with the same semantic meanings. Recent CLIP-based approaches have explored two frameworks: Two-Tower versus Single-Tower framework, yet the former suffers from low effectiveness, while the latter suffers from low efficiency. In this study, we explore a new Hybrid-Tower framework that can hybridize the advantages of the Two-Tower and Single-Tower framework, achieving high effectiveness and efficiency simultaneously. We propose a novel hybrid method, Fine-grained Pseudo-query Interaction and Generation for T2VR, ie, PIG, which includes a new pseudo-query generator designed to generate a pseudo-query for each video. This enables the video feature and the textual features of pseudo-query to interact in a fine-grained manner, similar to the Single-Tower approaches to hold high effectiveness, even before the real textual query is received. Simultaneously, our method introduces no additional storage or computational overhead compared to the Two-Tower framework during the inference stage, thus maintaining high efficiency. Extensive experiments on five commonly used text-video retrieval benchmarks demonstrate that our method achieves a significant improvement over the baseline, with an increase of $1.6% sim 3.9%$ in R@1. Furthermore, our method matches the efficiency of Two-Tower models while achieving near state-of-the-art performance, highlighting the advantages of the Hybrid-Tower framework.