🤖 AI Summary
Existing category-agnostic 3D instance segmentation methods suffer from limited generalization due to scarce real-world annotations and noisy 2D priors; meanwhile, mainstream 3D synthetic data fail to simultaneously ensure geometric diversity, contextual complexity, and layout plausibility. This paper introduces the first data synthesis framework explicitly designed for category-agnostic segmentation. Leveraging a heterogeneous CAD asset library, it integrates large language model–driven spatial layout reasoning, depth-first search–based layout optimization, and multi-view RGB-D rendering with point cloud fusion to generate high-fidelity, diverse, and semantically plausible synthetic scenes. Evaluated on ScanNetV2, ScanNet++, and S3DIS, our synthesized data significantly boosts zero-shot generalization performance of state-of-the-art models—including Mask3D and OpenScene—outperforming prior synthetic approaches. Results empirically validate the critical role of structured, layout-aware synthetic data in advancing category-agnostic 3D instance segmentation.
📝 Abstract
Class-agnostic 3D instance segmentation tackles the challenging task of segmenting all object instances, including previously unseen ones, without semantic class reliance. Current methods struggle with generalization due to the scarce annotated 3D scene data or noisy 2D segmentations. While synthetic data generation offers a promising solution, existing 3D scene synthesis methods fail to simultaneously satisfy geometry diversity, context complexity, and layout reasonability, each essential for this task. To address these needs, we propose an Adapted 3D Scene Synthesis pipeline for class-agnostic 3D Instance SegmenTation, termed as ASSIST-3D, to synthesize proper data for model generalization enhancement. Specifically, ASSIST-3D features three key innovations, including 1) Heterogeneous Object Selection from extensive 3D CAD asset collections, incorporating randomness in object sampling to maximize geometric and contextual diversity; 2) Scene Layout Generation through LLM-guided spatial reasoning combined with depth-first search for reasonable object placements; and 3) Realistic Point Cloud Construction via multi-view RGB-D image rendering and fusion from the synthetic scenes, closely mimicking real-world sensor data acquisition. Experiments on ScanNetV2, ScanNet++, and S3DIS benchmarks demonstrate that models trained with ASSIST-3D-generated data significantly outperform existing methods. Further comparisons underscore the superiority of our purpose-built pipeline over existing 3D scene synthesis approaches.