M$^4$-SAM: Multi-Modal Mixture-of-Experts with Memory-Augmented SAM for RGB-D Video Salient Object Detection

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This work addresses the limitations of the SAM2 model in RGB-D video salient object detection (VSOD), including its constrained linear LoRA-based spatial modeling, insufficient exploitation of multi-scale features, and reliance on explicit prompt initialization. To overcome these issues, the paper introduces three key innovations: a modality-aware MoE-LoRA architecture for efficient multimodal fine-tuning, an adaptive gating mechanism for hierarchical feature fusion, and a novel pseudo-guided memory initialization strategy that eliminates the need for manual prompts. By integrating parameter-efficient fine-tuning, a mixture-of-experts structure, and memory-augmented learning, the proposed method achieves state-of-the-art performance across three public RGB-D VSOD benchmarks, consistently outperforming existing approaches.

📝 Abstract

The Segment Anything Model 2 (SAM2) has emerged as a foundation model for universal segmentation. Owing to its generalizable visual representations, SAM2 has been successfully applied to various downstream tasks. However, extending SAM2 to the RGB-D video salient object detection (RGB-D VSOD) task encounters three challenges including limited spatial modeling of linear LoRA, insufficient employment of SAM's multi-scale features, and dependence of initialization on explicit prompts. To address the issues, we present Multi-Modal Mixture-of-Experts with Memory-Augmented SAM (M$^4$-SAM), which equips SAM2 with modality-related PEFT, hierarchical feature fusion, and prompt-free memory initialization. Firstly, we inject Modality-Aware MoE-LORA, which employs convolutional experts to encode local spatial priors and introduces a modality dispatcher for efficient multi-modal fine-tuning, into SAM2's encoder. Secondly, we deploy Gated Multi-Level Feature Fusion, which hierarchically aggregates multi-scale encoder features with an adaptive gating mechanism, to balance spatial details and semantic context. Finally, to conduct zero-shot VSOD without manual prompts, we utilize a Pseudo-Guided Initialization, where a coarse mask is regarded as a pseudo prior and used to bootstrap the memory bank. Extensive experiments demonstrate that M$^4$-SAM achieves the state-of-the-art performance across all evaluation metrics on three public RGB-D VSOD datasets.

Problem

Research questions and friction points this paper is trying to address.

RGB-D video salient object detection

SAM2

spatial modeling

multi-scale features

prompt initialization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modality-Aware MoE-LoRA

Gated Multi-Level Feature Fusion

Pseudo-Guided Initialization

Memory-Augmented SAM

RGB-D Video Salient Object Detection

🔎 Similar Papers

No similar papers found.

Toyota Research Institute

Los Altos, CA

AI Research Scientist, Computer Vision - Facebook Video Intelligence