Segment and Matte Anything in a Unified Model

📅 2026-01-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing approaches struggle to balance zero-shot segmentation accuracy with interactive image matting, lacking a unified framework capable of simultaneously achieving high-quality segmentation and fine-grained alpha matte generation. This work proposes SAMA, a lightweight unified model that extends the Segment Anything Model (SAM) by incorporating a Multi-View Local Encoder (MVLE), a Localization Adapter (Local-Adapter), and a dual-task prediction head, thereby integrating interactive segmentation and matting within a single architecture for the first time. Through a joint training strategy, SAMA significantly enhances boundary detail recovery with only a marginal increase in parameters, achieving state-of-the-art performance across multiple segmentation and matting benchmarks and demonstrating its efficiency and versatility for diverse downstream tasks.

Technology Category

Application Category

📝 Abstract
Segment Anything (SAM) has recently pushed the boundaries of segmentation by demonstrating zero-shot generalization and flexible prompting after training on over one billion masks. Despite this, its mask prediction accuracy often falls short of the precision required in real-world applications. While several refinement modules have been proposed to boost SAM's segmentation quality, achieving highly accurate object delineation within a single, unified framework remains an open challenge. Furthermore, interactive image matting, which aims to generate fine-grained alpha mattes guided by diverse user hints, has not yet been explored in the context of SAM. Insights from recent studies highlight strong correlations between segmentation and matting, suggesting the feasibility of a unified model capable of both tasks. In this paper, we introduce Segment And Matte Anything (SAMA), a lightweight extension of SAM that delivers high-quality interactive image segmentation and matting with minimal extra parameters. Our Multi-View Localization Encoder (MVLE) captures detailed features from local views, while the Localization Adapter (Local-Adapter) refines mask outputs by recovering subtle boundary details. We also incorporate two prediction heads for each task into the architecture to generate segmentation and matting masks, simultaneously. Trained on a diverse dataset aggregated from publicly available sources, SAMA achieves state-of-the-art performance across multiple segmentation and matting benchmarks, showcasing its adaptability and effectiveness in a wide range of downstream tasks.
Problem

Research questions and friction points this paper is trying to address.

image segmentation
image matting
interactive prompting
mask accuracy
unified model
Innovation

Methods, ideas, or system contributions that make the work stand out.

unified segmentation and matting
Multi-View Localization Encoder
Localization Adapter
interactive image matting
lightweight extension
🔎 Similar Papers
No similar papers found.