GrabS: Generative Embodied Agent for 3D Object Segmentation without Scene Supervision

๐Ÿ“… 2025-04-16
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Unsupervised 3D point cloud object segmentation struggles with complex indoor scenes and often relies on pre-trained 2D features or motion priors, limiting generalizability and autonomy. Method: This paper introduces the first generative embodied agent framework for unsupervised 3D object segmentation. It operates in two stages: (i) joint learning of generative and discriminative object-centric priors on object-level datasets; and (ii) active querying of these priors by an embodied agent to enable collaborative multi-object discovery and segmentation. Contribution/Results: The approach eliminates dependence on 2D features or motion cues, instead introducing object-centric generative modeling, embodied query strategies, contrastive prior alignment, and unsupervised clustering optimization. Evaluated on two real-world datasets and a newly constructed synthetic dataset, it achieves a >12.6% improvement in mean Average Precision (mAP) over state-of-the-art unsupervised methodsโ€”marking the first demonstration of fine-grained, robust multi-object segmentation in complex indoor scenes without supervision.

Technology Category

Application Category

๐Ÿ“ Abstract
We study the hard problem of 3D object segmentation in complex point clouds without requiring human labels of 3D scenes for supervision. By relying on the similarity of pretrained 2D features or external signals such as motion to group 3D points as objects, existing unsupervised methods are usually limited to identifying simple objects like cars or their segmented objects are often inferior due to the lack of objectness in pretrained features. In this paper, we propose a new two-stage pipeline called GrabS. The core concept of our method is to learn generative and discriminative object-centric priors as a foundation from object datasets in the first stage, and then design an embodied agent to learn to discover multiple objects by querying against the pretrained generative priors in the second stage. We extensively evaluate our method on two real-world datasets and a newly created synthetic dataset, demonstrating remarkable segmentation performance, clearly surpassing all existing unsupervised methods.
Problem

Research questions and friction points this paper is trying to address.

Unsupervised 3D object segmentation in complex point clouds
Overcoming limitations of pretrained 2D features for object grouping
Learning generative and discriminative priors for object discovery
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage pipeline for unsupervised 3D segmentation
Generative and discriminative object-centric priors
Embodied agent queries pretrained generative priors
๐Ÿ”Ž Similar Papers
Zihui Zhang
Zihui Zhang
The Hong Kong Polytechnic University
3D Vision
Y
Yafei Yang
Shenzhen Research Institute, The Hong Kong Polytechnic University, vLAR Group, The Hong Kong Polytechnic University
Hongtao Wen
Hongtao Wen
PhD student, The Hong Kong Polytechnic University
RoboticsComputer VisionDeep Learning
B
Bo Yang
Shenzhen Research Institute, The Hong Kong Polytechnic University, vLAR Group, The Hong Kong Polytechnic University