M2N2V2: Multi-Modal Unsupervised and Training-free Interactive Segmentation

📅 2025-03-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the mIoU degradation in unsupervised, training-free point-based interactive segmentation caused by inconsistent segmentation size estimation. We propose a depth-guided Markov mapping framework coupled with sequential prompt modeling. By fusing RGB and depth modalities—where depth maps are generated via Depth Anything V2—we construct depth-aware pixel affinities through attention mechanisms and nearest-neighbor propagation. An adaptive scoring function is further introduced to dynamically suppress size jitter. To our knowledge, this is the first unsupervised approach to formulate Markov state transitions with explicit depth guidance. Evaluated on DAVIS and HQSeg44K, our method achieves significantly lower Number-of-Clicks (NoC) than SAM and SimpleClick, and outperforms M2N2 in both mIoU and NoC across all domains except medical imaging. Moreover, it substantially narrows the performance gap with supervised methods.

Technology Category

Application Category

📝 Abstract
We present Markov Map Nearest Neighbor V2 (M2N2V2), a novel and simple, yet effective approach which leverages depth guidance and attention maps for unsupervised and training-free point-prompt-based interactive segmentation. Following recent trends in supervised multimodal approaches, we carefully integrate depth as an additional modality to create novel depth-guided Markov-maps. Furthermore, we observe occasional segment size fluctuations in M2N2 during the interactive process, which can decrease the overall mIoU's. To mitigate this problem, we model the prompting as a sequential process and propose a novel adaptive score function which considers the previous segmentation and the current prompt point in order to prevent unreasonable segment size changes. Using Stable Diffusion 2 and Depth Anything V2 as backbones, we empirically show that our proposed M2N2V2 significantly improves the Number of Clicks (NoC) and mIoU compared to M2N2 in all datasets except those from the medical domain. Interestingly, our unsupervised approach achieves competitive results compared to supervised methods like SAM and SimpleClick in the more challenging DAVIS and HQSeg44K datasets in the NoC metric, reducing the gap between supervised and unsupervised methods.
Problem

Research questions and friction points this paper is trying to address.

Unsupervised, training-free interactive segmentation using depth guidance.
Mitigates segment size fluctuations with adaptive score function.
Improves Number of Clicks and mIoU across datasets.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Depth-guided Markov-maps for segmentation
Adaptive score function prevents size fluctuations
Unsupervised approach competes with supervised methods
🔎 Similar Papers
No similar papers found.
M
Markus Karmann
Vivo Tech Research GmbH
Peng-Tao Jiang
Peng-Tao Jiang
Researcher, vivo
Diffusion ModelsDense PredictionsVisual Attention
B
Bo Li
vivo Mobile Communication Co., Ltd, Shanghai, China.
Onay Urfalioglu
Onay Urfalioglu
Unknown affiliation