UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity

📅 2025-11-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
SAM-family models lack explicit, continuous control over segmentation granularity; users rely on manual prompt engineering or post-hoc mask filtering—processes that are ambiguous and poorly generalizable. Method: We propose the first annotation-free, arbitrary-granularity image segmentation framework, extending SAM-2 via self-supervised learning. Our approach introduces a lightweight (0.02% parameter increase) granularity-aware module and a novel granularity-control embedding mechanism, coupled with an unsupervised partitioning strategy trained on only 6K unlabeled images to enable fine-grained, continuous granularity modulation. The method supports interactive, full-image, and video segmentation. Results: Evaluated across 11 benchmarks, our method achieves substantial improvements: Number of Clicks to reach 90% IoU (NoC90) decreases from 5.69 to 4.75; 1−IoU improves to 73.1; and Average Recall at 1000 proposals (AR1000) rises to 68.3.

Technology Category

Application Category

📝 Abstract
The Segment Anything Model (SAM) family has become a widely adopted vision foundation model, but its ability to control segmentation granularity remains limited. Users often need to refine results manually - by adding more prompts or selecting from pre-generated masks - to achieve the desired level of detail. This process can be ambiguous, as the same prompt may correspond to several plausible masks, and collecting dense annotations across all granularities is prohibitively expensive, making supervised solutions infeasible. To address this limitation, we introduce UnSAMv2, which enables segment anything at any granularity without human annotations. UnSAMv2 extends the divide-and-conquer strategy of UnSAM by discovering abundant mask-granularity pairs and introducing a novel granularity control embedding that enables precise, continuous control over segmentation scale. Remarkably, with only $6$K unlabeled images and $0.02%$ additional parameters, UnSAMv2 substantially enhances SAM-2, achieving segment anything at any granularity across interactive, whole-image, and video segmentation tasks. Evaluated on over $11$ benchmarks, UnSAMv2 improves $ ext{NoC}_{90}$ (5.69 $ ightarrow$ 4.75), 1-IoU (58.0 $ ightarrow$ 73.1), and $ ext{AR}_{1000}$ (49.6 $ ightarrow$ 68.3), showing that small amounts of unlabeled data with a granularity-aware self-supervised learning method can unlock the potential of vision foundation models.
Problem

Research questions and friction points this paper is trying to address.

SAM models lack precise control over segmentation granularity levels
Manual refinement is ambiguous and requires expensive dense annotations
Existing methods cannot achieve continuous granularity control without supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised learning enables granularity control without annotations
Introduces granularity control embedding for continuous segmentation scale
Uses divide-and-conquer strategy with minimal parameters and unlabeled data
🔎 Similar Papers
No similar papers found.