MemorySAM: Memorize Modalities and Semantics with Segment Anything Model 2 for Multi-modal Semantic Segmentation

📅 2025-03-09

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses the insufficient cross-modal semantic understanding in multimodal semantic segmentation (MMSS). To this end, it pioneers the adaptation of SAM2’s video memory mechanism to multimodal image sequence modeling, enabling temporal extraction of modality-agnostic features. We propose a Semantic Prototype Memory Module (SPMM) and a prototype alignment loss to explicitly model class-level semantic prototypes, thereby facilitating SAM2’s efficient adaptation from instance to semantic segmentation. The method integrates zero-shot transfer learning with cross-modal sequential modeling. Evaluated on the DELIVER and MCubeS benchmarks, it achieves 65.38% and 52.88% mIoU, respectively—substantially surpassing state-of-the-art approaches. Our framework establishes a scalable, memory-augmented paradigm for multimodal segmentation, advancing both cross-modal representation learning and temporal semantic coherence in heterogeneous input streams.

Technology Category

Application Category

📝 Abstract

Research has focused on Multi-Modal Semantic Segmentation (MMSS), where pixel-wise predictions are derived from multiple visual modalities captured by diverse sensors. Recently, the large vision model, Segment Anything Model 2 (SAM2), has shown strong zero-shot segmentation performance on both images and videos. When extending SAM2 to MMSS, two issues arise: 1. How can SAM2 be adapted to multi-modal data? 2. How can SAM2 better understand semantics? Inspired by cross-frame correlation in videos, we propose to treat multi-modal data as a sequence of frames representing the same scene. Our key idea is to ''memorize'' the modality-agnostic information and 'memorize' the semantics related to the targeted scene. To achieve this, we apply SAM2's memory mechanisms across multi-modal data to capture modality-agnostic features. Meanwhile, to memorize the semantic knowledge, we propose a training-only Semantic Prototype Memory Module (SPMM) to store category-level prototypes across training for facilitating SAM2's transition from instance to semantic segmentation. A prototypical adaptation loss is imposed between global and local prototypes iteratively to align and refine SAM2's semantic understanding. Extensive experimental results demonstrate that our proposed MemorySAM outperforms SoTA methods by large margins on both synthetic and real-world benchmarks (65.38% on DELIVER, 52.88% on MCubeS). Source code will be made publicly available.

Problem

Research questions and friction points this paper is trying to address.

Adapt SAM2 for multi-modal data segmentation

Enhance SAM2's semantic understanding in segmentation

Develop memory mechanisms for modality-agnostic features

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapt SAM2 for multi-modal data processing

Introduce Semantic Prototype Memory Module

Use prototypical adaptation loss for refinement

🔎 Similar Papers

FusionSAM: Latent Space driven Segment Anything Model for Multimodal Fusion and Segmentation