Clutt3R-Seg: Sparse-view 3D Instance Segmentation for Language-grounded Grasping in Cluttered Scenes

📅 2026-02-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of unreliable 3D instance segmentation in cluttered scenes—where occlusion, sparse viewpoints, and noisy masks hinder language-guided robotic grasping—by proposing a zero-shot 3D instance segmentation method. It uniquely treats noisy masks as informative cues to construct a semantics-driven hierarchical instance tree. By integrating cross-view grouping, a conditional replacement strategy, and a consistency-aware update mechanism, the approach maintains robust instance correspondences using only a single user interaction image. The method further incorporates open-vocabulary semantic embeddings to enable natural language–guided object selection. Experiments demonstrate its superiority in highly cluttered environments, achieving an AP@25 of 61.66—more than 2.2 times higher than existing methods—and significantly outperforming MaskClustering with only four views compared to MaskClustering’s eight.

Technology Category

Application Category

📝 Abstract
Reliable 3D instance segmentation is fundamental to language-grounded robotic manipulation. Its critical application lies in cluttered environments, where occlusions, limited viewpoints, and noisy masks degrade perception. To address these challenges, we present Clutt3R-Seg, a zero-shot pipeline for robust 3D instance segmentation for language-grounded grasping in cluttered scenes. Our key idea is to introduce a hierarchical instance tree of semantic cues. Unlike prior approaches that attempt to refine noisy masks, our method leverages them as informative cues: through cross-view grouping and conditional substitution, the tree suppresses over- and under-segmentation, yielding view-consistent masks and robust 3D instances. Each instance is enriched with open-vocabulary semantic embeddings, enabling accurate target selection from natural language instructions. To handle scene changes during multi-stage tasks, we further introduce a consistency-aware update that preserves instance correspondences from only a single post-interaction image, allowing efficient adaptation without rescanning. Clutt3R-Seg is evaluated on both synthetic and real-world datasets, and validated on a real robot. Across all settings, it consistently outperforms state-of-the-art baselines in cluttered and sparse-view scenarios. Even on the most challenging heavy-clutter sequences, Clutt3R-Seg achieves an AP@25 of 61.66, over 2.2x higher than baselines, and with only four input views it surpasses MaskClustering with eight views by more than 2x. The code is available at: https://github.com/jeonghonoh/clutt3r-seg.
Problem

Research questions and friction points this paper is trying to address.

3D instance segmentation
language-grounded grasping
cluttered scenes
sparse-view
occlusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

hierarchical instance tree
zero-shot 3D instance segmentation
language-grounded grasping
open-vocabulary semantic embedding
consistency-aware update
🔎 Similar Papers
No similar papers found.
J
Jeongho Noh
Dept. of Mechanical Engineering, SNU, Seoul, S. Korea
T
Tai Hyoung Rhee
Dept. of Mechanical Engineering, SNU, Seoul, S. Korea
E
Eunho Lee
Interdisciplinary Program in Artificial Intelligence, SNU, Seoul, S. Korea
Jeongyun Kim
Jeongyun Kim
서울대학교 박사과정
SLAMcomputer vision
S
Sunwoo Lee
Robotics Lab, Hyundai Motor Company, S. Korea
Ayoung Kim
Ayoung Kim
Seoul National University
SLAMUnderwater Robotnavigationmappingcomputer vision