S2-UniSeg: Fast Universal Agglomerative Pooling for Scalable Segment Anything without Supervision

📅 2025-08-09

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

Existing self-supervised image segmentation methods rely on multi-stage training with offline pseudo-mask generation, resulting in poor scalability and discontinuous optimization. This paper proposes UniSeg, an end-to-end trainable universal segmentation framework. It introduces UniAP—a millisecond-level pseudo-mask generation algorithm—and integrates query-wise self-distillation with a momentum-based teacher-student architecture to jointly model semantic and instance segmentation at multiple granularities on SA-1B. By eliminating costly offline steps, UniSeg enables continuous optimization and efficient large-scale extension. Extensive experiments demonstrate consistent superiority over UnSAM: +6.9 AP, +11.1 AR on COCO; +4.5 Pixel Accuracy on COCOStuff-27; +8.0 RQ on UVO; and significant gains on Cityscapes. Performance further improves after large-scale training.

Technology Category

Application Category

📝 Abstract

Recent self-supervised image segmentation models have achieved promising performance on semantic segmentation and class-agnostic instance segmentation. However, their pretraining schedule is multi-stage, requiring a time-consuming pseudo-masks generation process between each training epoch. This time-consuming offline process not only makes it difficult to scale with training dataset size, but also leads to sub-optimal solutions due to its discontinuous optimization routine. To solve these, we first present a novel pseudo-mask algorithm, Fast Universal Agglomerative Pooling (UniAP). Each layer of UniAP can identify groups of similar nodes in parallel, allowing to generate both semantic-level and instance-level and multi-granular pseudo-masks within ens of milliseconds for one image. Based on the fast UniAP, we propose the Scalable Self-Supervised Universal Segmentation (S2-UniSeg), which employs a student and a momentum teacher for continuous pretraining. A novel segmentation-oriented pretext task, Query-wise Self-Distillation (QuerySD), is proposed to pretrain S2-UniSeg to learn the local-to-global correspondences. Under the same setting, S2-UniSeg outperforms the SOTA UnSAM model, achieving notable improvements of AP+6.9 on COCO, AR+11.1 on UVO, PixelAcc+4.5 on COCOStuff-27, RQ+8.0 on Cityscapes. After scaling up to a larger 2M-image subset of SA-1B, S2-UniSeg further achieves performance gains on all four benchmarks. Our code and pretrained models are available at https://github.com/bio-mlhui/S2-UniSeg

Problem

Research questions and friction points this paper is trying to address.

Eliminates time-consuming pseudo-mask generation in self-supervised segmentation

Enables scalable training without discontinuous optimization routines

Generates multi-granular segmentation masks within milliseconds per image

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fast Universal Agglomerative Pooling algorithm

Student-teacher continuous pretraining architecture

Query-wise Self-Distillation pretext task

🔎 Similar Papers

Deep Common Feature Mining for Efficient Video Semantic Segmentation