RGB-D Video Object Segmentation via Enhanced Multi-store Feature Memory

📅 2024-05-30

🏛️ International Conference on Multimedia Retrieval

📈 Citations: 1

✨ Influential: 0

career value

234K/year

🤖 AI Summary

Existing RGB-D video object segmentation (VOS) methods suffer from insufficient cross-modal information exploitation and long-term target drift. To address these issues, we propose a Multi-Storage Feature Memory Network that employs a hierarchical modality-selective fusion mechanism to achieve adaptive alignment and complementary modeling of RGB and depth features. We further introduce the Segment Anything Model (SAM) into RGB-D VOS for the first time, designing spatio-temporal–modal hybrid prompts to guide segmentation refinement; SAM and the prompt engineering pipeline are jointly fine-tuned to enhance boundary precision. Our method achieves state-of-the-art performance on the latest RGB-D VOS benchmarks, significantly improving temporal consistency in long videos and contour accuracy. This work establishes a novel paradigm for cross-modal video understanding by unifying memory-augmented feature fusion with prompt-driven, multimodal segmentation refinement.

Technology Category

Application Category

📝 Abstract

The RGB-Depth (RGB-D) Video Object Segmentation (VOS) aims to integrate the fine-grained texture information of RGB with the spatial geometric clues of depth modality, boosting the performance of segmentation. However, off-the-shelf RGB-D segmentation methods fail to fully explore cross-modal information and suffer from object drift during long-term prediction. In this paper, we propose a novel RGB-D VOS method via multi-store feature memory for robust segmentation. Specifically, we design the hierarchical modality selection and fusion, which adaptively combines features from both modalities. Additionally, we develop a segmentation refinement module that effectively utilizes the Segmentation Anything Model (SAM) to refine the segmentation mask, ensuring more reliable results as memory to guide subsequent segmentation tasks. By leveraging spatio-temporal embedding and modality embedding, mixed prompts and fused images are fed into SAM to unleash its potential in RGB-D VOS. Experimental results show that the proposed method achieves state-of-the-art performance on the latest RGB-D VOS benchmark.

Problem

Research questions and friction points this paper is trying to address.

Enhancing RGB-D video object segmentation via cross-modal fusion

Addressing object drift in long-term RGB-D segmentation predictions

Refining segmentation masks using SAM for improved memory guidance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical modality selection and fusion

Segmentation refinement using SAM

Spatio-temporal and modality embedding

🔎 Similar Papers

No similar papers found.

Bosch Group

Hildesheim, NDS, DE

Research Scientist Intern, Multimodal Generative AI and Robotics (PhD)