MUSE: Mamba is Efficient Multi-scale Learner for Text-video Retrieval

📅 2024-08-20

🏛️ arXiv.org

📈 Citations: 2

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Existing text-to-video retrieval methods predominantly rely on single-scale CLIP architectures, limiting their capacity to model multi-granularity semantics. To address this, we propose the first Mamba-driven framework for multi-scale vision-language alignment. Our approach constructs a cross-resolution feature pyramid and incorporates linear-complexity Mamba modules to enable efficient joint modeling across scales, complemented by a lightweight cross-modal contrastive learning mechanism. Crucially, we are the first to introduce state-space models—specifically Mamba—into multi-scale cross-modal alignment, overcoming the quadratic computational bottleneck inherent in Transformer-based architectures. Extensive experiments demonstrate state-of-the-art performance on MSR-VTT, MSVD, and DiDeMo. Moreover, our method achieves a 3.2× speedup in inference time and reduces parameter count by 41%, significantly enhancing fine-grained semantic matching capability.

Technology Category

Application Category

📝 Abstract

Text-Video Retrieval (TVR) aims to align and associate relevant video content with corresponding natural language queries. Most existing TVR methods are based on large-scale pre-trained vision-language models (e.g., CLIP). However, due to the inherent plain structure of CLIP, few TVR methods explore the multi-scale representations which offer richer contextual information for a more thorough understanding. To this end, we propose MUSE, a multi-scale mamba with linear computational complexity for efficient cross-resolution modeling. Specifically, the multi-scale representations are generated by applying a feature pyramid on the last single-scale feature map. Then, we employ the Mamba structure as an efficient multi-scale learner to jointly learn scale-wise representations. Furthermore, we conduct comprehensive studies to investigate different model structures and designs. Extensive results on three popular benchmarks have validated the superiority of MUSE.

Problem

Research questions and friction points this paper is trying to address.

Enhances text-video retrieval accuracy

Utilizes multi-scale representations efficiently

Reduces computational complexity in modeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-scale mamba for TVR

Linear computational complexity

Feature pyramid for contextual information

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs