🤖 AI Summary
Existing text-to-video retrieval methods predominantly rely on single-scale CLIP architectures, limiting their capacity to model multi-granularity semantics. To address this, we propose the first Mamba-driven framework for multi-scale vision-language alignment. Our approach constructs a cross-resolution feature pyramid and incorporates linear-complexity Mamba modules to enable efficient joint modeling across scales, complemented by a lightweight cross-modal contrastive learning mechanism. Crucially, we are the first to introduce state-space models—specifically Mamba—into multi-scale cross-modal alignment, overcoming the quadratic computational bottleneck inherent in Transformer-based architectures. Extensive experiments demonstrate state-of-the-art performance on MSR-VTT, MSVD, and DiDeMo. Moreover, our method achieves a 3.2× speedup in inference time and reduces parameter count by 41%, significantly enhancing fine-grained semantic matching capability.
📝 Abstract
Text-Video Retrieval (TVR) aims to align and associate relevant video content with corresponding natural language queries. Most existing TVR methods are based on large-scale pre-trained vision-language models (e.g., CLIP). However, due to the inherent plain structure of CLIP, few TVR methods explore the multi-scale representations which offer richer contextual information for a more thorough understanding. To this end, we propose MUSE, a multi-scale mamba with linear computational complexity for efficient cross-resolution modeling. Specifically, the multi-scale representations are generated by applying a feature pyramid on the last single-scale feature map. Then, we employ the Mamba structure as an efficient multi-scale learner to jointly learn scale-wise representations. Furthermore, we conduct comprehensive studies to investigate different model structures and designs. Extensive results on three popular benchmarks have validated the superiority of MUSE.