M$^4$-SAM: Multi-Modal Mixture-of-Experts with Memory-Augmented SAM for RGB-D Video Salient Object Detection

📅 2026-05-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

229K/year
🤖 AI Summary
This work addresses the limitations of the SAM2 model in RGB-D video salient object detection (VSOD), including its constrained linear LoRA-based spatial modeling, insufficient exploitation of multi-scale features, and reliance on explicit prompt initialization. To overcome these issues, the paper introduces three key innovations: a modality-aware MoE-LoRA architecture for efficient multimodal fine-tuning, an adaptive gating mechanism for hierarchical feature fusion, and a novel pseudo-guided memory initialization strategy that eliminates the need for manual prompts. By integrating parameter-efficient fine-tuning, a mixture-of-experts structure, and memory-augmented learning, the proposed method achieves state-of-the-art performance across three public RGB-D VSOD benchmarks, consistently outperforming existing approaches.
📝 Abstract
The Segment Anything Model 2 (SAM2) has emerged as a foundation model for universal segmentation. Owing to its generalizable visual representations, SAM2 has been successfully applied to various downstream tasks. However, extending SAM2 to the RGB-D video salient object detection (RGB-D VSOD) task encounters three challenges including limited spatial modeling of linear LoRA, insufficient employment of SAM's multi-scale features, and dependence of initialization on explicit prompts. To address the issues, we present Multi-Modal Mixture-of-Experts with Memory-Augmented SAM (M$^4$-SAM), which equips SAM2 with modality-related PEFT, hierarchical feature fusion, and prompt-free memory initialization. Firstly, we inject Modality-Aware MoE-LORA, which employs convolutional experts to encode local spatial priors and introduces a modality dispatcher for efficient multi-modal fine-tuning, into SAM2's encoder. Secondly, we deploy Gated Multi-Level Feature Fusion, which hierarchically aggregates multi-scale encoder features with an adaptive gating mechanism, to balance spatial details and semantic context. Finally, to conduct zero-shot VSOD without manual prompts, we utilize a Pseudo-Guided Initialization, where a coarse mask is regarded as a pseudo prior and used to bootstrap the memory bank. Extensive experiments demonstrate that M$^4$-SAM achieves the state-of-the-art performance across all evaluation metrics on three public RGB-D VSOD datasets.
Problem

Research questions and friction points this paper is trying to address.

RGB-D video salient object detection
SAM2
spatial modeling
multi-scale features
prompt initialization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Modality-Aware MoE-LoRA
Gated Multi-Level Feature Fusion
Pseudo-Guided Initialization
Memory-Augmented SAM
RGB-D Video Salient Object Detection
🔎 Similar Papers
No similar papers found.