MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval

📅 2024-08-20
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Existing text-to-video retrieval methods predominantly rely on single-scale CLIP architectures, limiting their capacity to model multi-granularity semantics. To address this, we propose the first Mamba-driven framework for multi-scale vision-language alignment. Our approach constructs a cross-resolution feature pyramid and incorporates linear-complexity Mamba modules to enable efficient joint modeling across scales, complemented by a lightweight cross-modal contrastive learning mechanism. Crucially, we are the first to introduce state-space models—specifically Mamba—into multi-scale cross-modal alignment, overcoming the quadratic computational bottleneck inherent in Transformer-based architectures. Extensive experiments demonstrate state-of-the-art performance on MSR-VTT, MSVD, and DiDeMo. Moreover, our method achieves a 3.2× speedup in inference time and reduces parameter count by 41%, significantly enhancing fine-grained semantic matching capability.

Technology Category

Application Category

📝 Abstract
Text-Video Retrieval (TVR) aims to align and associate relevant video content with corresponding natural language queries. Most existing TVR methods are based on large-scale pre-trained vision-language models (e.g., CLIP). However, due to the inherent plain structure of CLIP, few TVR methods explore the multi-scale representations which offer richer contextual information for a more thorough understanding. To this end, we propose MUSE, a multi-scale mamba with linear computational complexity for efficient cross-resolution modeling. Specifically, the multi-scale representations are generated by applying a feature pyramid on the last single-scale feature map. Then, we employ the Mamba structure as an efficient multi-scale learner to jointly learn scale-wise representations. Furthermore, we conduct comprehensive studies to investigate different model structures and designs. Extensive results on three popular benchmarks have validated the superiority of MUSE.
Problem

Research questions and friction points this paper is trying to address.

Enhances text-video retrieval accuracy
Utilizes multi-scale representations efficiently
Reduces computational complexity in modeling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-scale mamba for TVR
Linear computational complexity
Feature pyramid for contextual information
🔎 Similar Papers
No similar papers found.
H
Haoran Tang
School of Electronic and Computer Engineering, Peking University; Peng Cheng Laboratory
Meng Cao
Meng Cao
Postdoc, Carnegie Mellon University
Psychology
Jinfa Huang
Jinfa Huang
University of Rochester, Peking University
Vision and LanguageReasoning ModelsGenerative ModelsComputer Vision
R
Ruyang Liu
School of Electronic and Computer Engineering, Peking University; Peng Cheng Laboratory
P
Peng Jin
School of Electronic and Computer Engineering, Peking University; Peng Cheng Laboratory
Ge Li
Ge Li
Full Professor of Computer Science, Peking University
Program AnalysisProgram GenerationDeep Learning
Xiaodan Liang
Xiaodan Liang
Professor of Computer Science, Sun Yat-sen University, MBZUAI, CMU, NUS
Computer visionEmbodied AIMachine learning