SAM-DAQ: Segment Anything Model with Depth-guided Adaptive Queries for RGB-D Video Salient Object Detection

📅 2025-11-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
RGB-D video salient object detection (RGB-D VSOD) faces three key bottlenecks when directly adapting the Segment Anything Model (SAM): heavy reliance on manual prompts, excessive memory overhead from sequential adapters, and high computational complexity in temporal memory modeling. To address these, we propose a depth-guided adaptive query framework. Our core contributions are: (1) a depth-guided parallel adapter that decouples depth cue integration from temporal modeling; (2) a query-driven lightweight temporal memory module, incorporating learnable queries and an efficient memory update strategy; and (3) depth-aware skip connections built upon SAM2. The framework operates fully prompt-free. Extensive experiments demonstrate state-of-the-art performance across three benchmark RGB-D video datasets—NJU2K, NLPR, and STEREO-797—achieving new best results on multiple quantitative metrics.

Technology Category

Application Category

📝 Abstract
Recently segment anything model (SAM) has attracted widespread concerns, and it is often treated as a vision foundation model for universal segmentation. Some researchers have attempted to directly apply the foundation model to the RGB-D video salient object detection (RGB-D VSOD) task, which often encounters three challenges, including the dependence on manual prompts, the high memory consumption of sequential adapters, and the computational burden of memory attention. To address the limitations, we propose a novel method, namely Segment Anything Model with Depth-guided Adaptive Queries (SAM-DAQ), which adapts SAM2 to pop-out salient objects from videos by seamlessly integrating depth and temporal cues within a unified framework. Firstly, we deploy a parallel adapter-based multi-modal image encoder (PAMIE), which incorporates several depth-guided parallel adapters (DPAs) in a skip-connection way. Remarkably, we fine-tune the frozen SAM encoder under prompt-free conditions, where the DPA utilizes depth cues to facilitate the fusion of multi-modal features. Secondly, we deploy a query-driven temporal memory (QTM) module, which unifies the memory bank and prompt embeddings into a learnable pipeline. Concretely, by leveraging both frame-level queries and video-level queries simultaneously, the QTM module can not only selectively extract temporal consistency features but also iteratively update the temporal representations of the queries. Extensive experiments are conducted on three RGB-D VSOD datasets, and the results show that the proposed SAM-DAQ consistently outperforms state-of-the-art methods in terms of all evaluation metrics.
Problem

Research questions and friction points this paper is trying to address.

Adapting SAM foundation model for RGB-D video salient object detection
Eliminating manual prompt dependence and reducing computational burden
Integrating depth and temporal cues in unified prompt-free framework
Innovation

Methods, ideas, or system contributions that make the work stand out.

Depth-guided parallel adapters fuse multi-modal features
Query-driven temporal memory extracts consistency features
Unified framework integrates depth and temporal cues
🔎 Similar Papers
No similar papers found.
J
Jia Lin
Hangzhou Dianzi University, Hangzhou, China
Xiaofei Zhou
Xiaofei Zhou
Shanghai Jiao Tong University
Human-Computer InteractionEducational TechnologyAI EducationAugmented RealityLearning
Jiyuan Liu
Jiyuan Liu
National University of Defense Technology
R
Runmin Cong
Shandong University, Jinan, China
G
Guodao Zhang
Hangzhou Dianzi University, Hangzhou, China
Z
Zhi Liu
Shanghai University, Shanghai, China
Jiyong Zhang
Jiyong Zhang
Hangzhou Dianzi University, Hangzhou, China