MamFusion: Multi-Mamba with Temporal Fusion for Partially Relevant Video Retrieval

📅 2025-06-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of information redundancy in long videos and difficulty in fine-grained temporal modeling in Partially Relevant Video Retrieval (PRVR), this paper proposes the first cross-modal retrieval framework built upon the Mamba state-space model. Methodologically, we design multiple Mamba modules to capture long-range video dynamics and introduce an explicit bidirectional temporal fusion mechanism—text-to-video and video-to-text—to jointly model cross-modal semantic evolution. Our approach further integrates multi-scale temporal encoding, cross-modal attention alignment, and contrastive learning for optimization. Extensive experiments on multiple large-scale PRVR benchmarks demonstrate state-of-the-art performance, with significant improvements in mean Average Precision (mAP) and Recall@K. The source code is publicly available.

Technology Category

Application Category

📝 Abstract
Partially Relevant Video Retrieval (PRVR) is a challenging task in the domain of multimedia retrieval. It is designed to identify and retrieve untrimmed videos that are partially relevant to the provided query. In this work, we investigate long-sequence video content understanding to address information redundancy issues. Leveraging the outstanding long-term state space modeling capability and linear scalability of the Mamba module, we introduce a multi-Mamba module with temporal fusion framework (MamFusion) tailored for PRVR task. This framework effectively captures the state-relatedness in long-term video content and seamlessly integrates it into text-video relevance understanding, thereby enhancing the retrieval process. Specifically, we introduce Temporal T-to-V Fusion and Temporal V-to-T Fusion to explicitly model temporal relationships between text queries and video moments, improving contextual awareness and retrieval accuracy. Extensive experiments conducted on large-scale datasets demonstrate that MamFusion achieves state-of-the-art performance in retrieval effectiveness. Code is available at the link: https://github.com/Vision-Multimodal-Lab-HZCU/MamFusion.
Problem

Research questions and friction points this paper is trying to address.

Addresses partially relevant video retrieval challenges
Leverages Mamba for long-term video content understanding
Improves retrieval accuracy with temporal fusion techniques
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Mamba module for long-term video content
Temporal T-to-V and V-to-T Fusion
State-relatedness integration in text-video relevance
🔎 Similar Papers
No similar papers found.
X
Xinru Ying
School of Computer and Computing Science, Hangzhou City University, Hangzhou, China
J
Jiaqi Mo
College of Letters & Science, University of Wisconsin–Madison, WI, USA
J
Jingyang Lin
School of Computer and Computing Science, Hangzhou City University, Hangzhou, China
Canghong Jin
Canghong Jin
Hangzhou City University
Data MiningBig data
Fangfang Wang
Fangfang Wang
Zhejiang University
computer visionmachine learning
Lina Wei
Lina Wei
School of Computer and Computing Science, Hangzhou City University, Hangzhou, China; Zhejiang Provincial Engineering Research Center for Real-Time SmartTech in Urban Security Governance