🤖 AI Summary
To address low 3D segmentation accuracy and cross-view semantic inconsistency caused by severe occlusion and large scale variations in industrial point clouds, this paper proposes a top-down, two-stage hierarchical image-guided segmentation framework. In the first stage, instance-level coarse segmentation is achieved via multi-view rendering coupled with YOLO-World and SAM. In the second stage, fine-grained refinement is performed through point cloud back-projection and part-level Bayesian fusion. A novel multi-view Bayesian update mechanism is introduced to significantly improve cross-view consistency and boundary precision. Crucially, the method requires no dense 3D annotations—only inexpensive 2D image supervision is needed. Evaluated on a real-world factory dataset, it achieves consistent mIoU improvements across all categories. Furthermore, strong generalization and robustness are validated on public benchmarks.
📝 Abstract
Reliable 3D segmentation is critical for understanding complex scenes with dense layouts and multi-scale objects, as commonly seen in industrial environments. In such scenarios, heavy occlusion weakens geometric boundaries between objects, and large differences in object scale will cause end-to-end models fail to capture both coarse and fine details accurately. Existing 3D point-based methods require costly annotations, while image-guided methods often suffer from semantic inconsistencies across views. To address these challenges, we propose a hierarchical image-guided 3D segmentation framework that progressively refines segmentation from instance-level to part-level. Instance segmentation involves rendering a top-view image and projecting SAM-generated masks prompted by YOLO-World back onto the 3D point cloud. Part-level segmentation is subsequently performed by rendering multi-view images of each instance obtained from the previous stage and applying the same 2D segmentation and back-projection process at each view, followed by Bayesian updating fusion to ensure semantic consistency across views. Experiments on real-world factory data demonstrate that our method effectively handles occlusion and structural complexity, achieving consistently high per-class mIoU scores. Additional evaluations on public dataset confirm the generalization ability of our framework, highlighting its robustness, annotation efficiency, and adaptability to diverse 3D environments.