SOHES: Self-supervised Open-world Hierarchical Entity Segmentation

📅 2024-04-18

🏛️ International Conference on Learning Representations

📈 Citations: 3

✨ Influential: 1

career value

148K/year

🤖 AI Summary

Open-world instance segmentation suffers from heavy reliance on manual annotations and struggles to achieve category-agnostic hierarchical segmentation. This paper introduces the first end-to-end self-supervised framework capable of performing hierarchical segmentation of both whole entities and their constituent parts in open-world settings—using only raw images, without any human supervision. Methodologically, we propose a three-stage self-supervised paradigm: self-exploration → self-guidance → self-correction. It integrates self-supervised visual representation learning, feature-space clustering for pseudo-label generation, teacher-student mutual learning for noise correction, and hierarchical mask modeling. To our knowledge, this is the first work to achieve open-world hierarchical segmentation entirely without manual annotations. Our approach significantly outperforms supervised methods such as SAM, setting new state-of-the-art results in both zero-shot concept generalization and self-supervised segmentation performance.

Technology Category

Application Category

📝 Abstract

Open-world entity segmentation, as an emerging computer vision task, aims at segmenting entities in images without being restricted by pre-defined classes, offering impressive generalization capabilities on unseen images and concepts. Despite its promise, existing entity segmentation methods like Segment Anything Model (SAM) rely heavily on costly expert annotators. This work presents Self-supervised Open-world Hierarchical Entity Segmentation (SOHES), a novel approach that eliminates the need for human annotations. SOHES operates in three phases: self-exploration, self-instruction, and self-correction. Given a pre-trained self-supervised representation, we produce abundant high-quality pseudo-labels through visual feature clustering. Then, we train a segmentation model on the pseudo-labels, and rectify the noises in pseudo-labels via a teacher-student mutual-learning procedure. Beyond segmenting entities, SOHES also captures their constituent parts, providing a hierarchical understanding of visual entities. Using raw images as the sole training data, our method achieves unprecedented performance in self-supervised open-world segmentation, marking a significant milestone towards high-quality open-world entity segmentation in the absence of human-annotated masks. Project page: https://SOHES.github.io.

Problem

Research questions and friction points this paper is trying to address.

Segments entities without predefined classes or human annotations

Uses self-supervised clustering to generate pseudo-labels for training

Provides hierarchical segmentation of entities and their constituent parts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised hierarchical segmentation without human annotations

Generates pseudo-labels via visual feature clustering

Refines segmentation through teacher-student mutual learning

🔎 Similar Papers

Advancing Grounded Multimodal Named Entity Recognition via LLM-Based Reformulation and Box-Based Segmentation