RGB-D Video Object Segmentation via Enhanced Multi-store Feature Memory

๐Ÿ“… 2024-05-30
๐Ÿ›๏ธ International Conference on Multimedia Retrieval
๐Ÿ“ˆ Citations: 1
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing RGB-D video object segmentation (VOS) methods suffer from insufficient cross-modal information exploitation and long-term target drift. To address these issues, we propose a Multi-Storage Feature Memory Network that employs a hierarchical modality-selective fusion mechanism to achieve adaptive alignment and complementary modeling of RGB and depth features. We further introduce the Segment Anything Model (SAM) into RGB-D VOS for the first time, designing spatio-temporalโ€“modal hybrid prompts to guide segmentation refinement; SAM and the prompt engineering pipeline are jointly fine-tuned to enhance boundary precision. Our method achieves state-of-the-art performance on the latest RGB-D VOS benchmarks, significantly improving temporal consistency in long videos and contour accuracy. This work establishes a novel paradigm for cross-modal video understanding by unifying memory-augmented feature fusion with prompt-driven, multimodal segmentation refinement.

Technology Category

Application Category

๐Ÿ“ Abstract
The RGB-Depth (RGB-D) Video Object Segmentation (VOS) aims to integrate the fine-grained texture information of RGB with the spatial geometric clues of depth modality, boosting the performance of segmentation. However, off-the-shelf RGB-D segmentation methods fail to fully explore cross-modal information and suffer from object drift during long-term prediction. In this paper, we propose a novel RGB-D VOS method via multi-store feature memory for robust segmentation. Specifically, we design the hierarchical modality selection and fusion, which adaptively combines features from both modalities. Additionally, we develop a segmentation refinement module that effectively utilizes the Segmentation Anything Model (SAM) to refine the segmentation mask, ensuring more reliable results as memory to guide subsequent segmentation tasks. By leveraging spatio-temporal embedding and modality embedding, mixed prompts and fused images are fed into SAM to unleash its potential in RGB-D VOS. Experimental results show that the proposed method achieves state-of-the-art performance on the latest RGB-D VOS benchmark.
Problem

Research questions and friction points this paper is trying to address.

Enhancing RGB-D video object segmentation via cross-modal fusion
Addressing object drift in long-term RGB-D segmentation predictions
Refining segmentation masks using SAM for improved memory guidance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical modality selection and fusion
Segmentation refinement using SAM
Spatio-temporal and modality embedding
๐Ÿ”Ž Similar Papers
No similar papers found.