Towards Fine-grained Interactive Segmentation in Images and Videos

📅 2025-02-12

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Current high-precision interactive segmentation methods face an inherent trade-off between local detail awareness and prompt robustness, limiting their practicality for fine-grained mask generation. This paper introduces the first general-purpose enhancement framework for fine-grained interactive segmentation, preserving SAM2’s generality while overcoming its limitations in local modeling. Our approach comprises three key innovations: (1) localization enhancement—leveraging cross-attention to model local contextual relationships; (2) prompt redirection—spatially aligning and remapping prompt embeddings; and (3) multi-scale mask refinement—employing cascaded feature fusion for progressive mask optimization. Evaluated on both image and video interactive segmentation benchmarks, our method consistently outperforms state-of-the-art approaches, achieving absolute mIoU gains of 3.2–5.7 percentage points. It enables real-time fine-grained editing and temporally consistent cross-frame segmentation, demonstrating significant advances in both accuracy and usability.

Technology Category

Application Category

📝 Abstract

The recent Segment Anything Models (SAMs) have emerged as foundational visual models for general interactive segmentation. Despite demonstrating robust generalization abilities, they still suffer performance degradations in scenarios demanding accurate masks. Existing methods for high-precision interactive segmentation face a trade-off between the ability to perceive intricate local details and maintaining stable prompting capability, which hinders the applicability and effectiveness of foundational segmentation models. To this end, we present an SAM2Refiner framework built upon the SAM2 backbone. This architecture allows SAM2 to generate fine-grained segmentation masks for both images and videos while preserving its inherent strengths. Specifically, we design a localization augment module, which incorporates local contextual cues to enhance global features via a cross-attention mechanism, thereby exploiting potential detailed patterns and maintaining semantic information. Moreover, to strengthen the prompting ability toward the enhanced object embedding, we introduce a prompt retargeting module to renew the embedding with spatially aligned prompt features. In addition, to obtain accurate high resolution segmentation masks, a mask refinement module is devised by employing a multi-scale cascaded structure to fuse mask features with hierarchical representations from the encoder. Extensive experiments demonstrate the effectiveness of our approach, revealing that the proposed method can produce highly precise masks for both images and videos, surpassing state-of-the-art methods.

Problem

Research questions and friction points this paper is trying to address.

Improving fine-grained segmentation accuracy in images and videos

Balancing local detail perception and prompting stability

Enhancing high-resolution mask generation in foundational segmentation models

Innovation

Methods, ideas, or system contributions that make the work stand out.

SAM2Refiner framework enhances segmentation

Localization augment module improves details

Prompt retargeting strengthens object embedding

🔎 Similar Papers

IDPro: Flexible Interactive Video Object Segmentation by ID-Queried Concurrent Propagation