๐ค AI Summary
Existing RGB-D video object segmentation (VOS) methods suffer from insufficient cross-modal information exploitation and long-term target drift. To address these issues, we propose a Multi-Storage Feature Memory Network that employs a hierarchical modality-selective fusion mechanism to achieve adaptive alignment and complementary modeling of RGB and depth features. We further introduce the Segment Anything Model (SAM) into RGB-D VOS for the first time, designing spatio-temporalโmodal hybrid prompts to guide segmentation refinement; SAM and the prompt engineering pipeline are jointly fine-tuned to enhance boundary precision. Our method achieves state-of-the-art performance on the latest RGB-D VOS benchmarks, significantly improving temporal consistency in long videos and contour accuracy. This work establishes a novel paradigm for cross-modal video understanding by unifying memory-augmented feature fusion with prompt-driven, multimodal segmentation refinement.
๐ Abstract
The RGB-Depth (RGB-D) Video Object Segmentation (VOS) aims to integrate the fine-grained texture information of RGB with the spatial geometric clues of depth modality, boosting the performance of segmentation. However, off-the-shelf RGB-D segmentation methods fail to fully explore cross-modal information and suffer from object drift during long-term prediction. In this paper, we propose a novel RGB-D VOS method via multi-store feature memory for robust segmentation. Specifically, we design the hierarchical modality selection and fusion, which adaptively combines features from both modalities. Additionally, we develop a segmentation refinement module that effectively utilizes the Segmentation Anything Model (SAM) to refine the segmentation mask, ensuring more reliable results as memory to guide subsequent segmentation tasks. By leveraging spatio-temporal embedding and modality embedding, mixed prompts and fused images are fed into SAM to unleash its potential in RGB-D VOS. Experimental results show that the proposed method achieves state-of-the-art performance on the latest RGB-D VOS benchmark.