🤖 AI Summary
In 3D point cloud instance segmentation, conventional Transformer-based methods suffer from suboptimal query initialization—struggling to jointly encode spatial and semantic information—and deep decoder layers often cause object disappearance, degrading recall. To address these issues, we propose (i) an Agent-Interpolation initialization module that jointly optimizes foreground coverage and semantic awareness via interpolation-based modeling; (ii) a hierarchical query fusion decoder that explicitly preserves low-overlap queries to mitigate object loss; and (iii) layer-wise supervision to enhance training stability. Our approach achieves state-of-the-art performance on ScanNetV2, ScanNet200, ScanNet++, and S3DIS, significantly improving both instance recall and segmentation accuracy—especially for small objects and heavily occluded scenes.
📝 Abstract
3D instance segmentation aims to predict a set of object instances in a scene and represent them as binary foreground masks with corresponding semantic labels. Currently, transformer-based methods are gaining increasing attention due to their elegant pipelines, reduced manual selection of geometric properties, and superior performance. However, transformer-based methods fail to simultaneously maintain strong position and content information during query initialization. Additionally, due to supervision at each decoder layer, there exists a phenomenon of object disappearance with the deepening of layers. To overcome these hurdles, we introduce Beyond the Final Layer: Hierarchical Query Fusion Transformer with Agent-Interpolation Initialization for 3D Instance Segmentation (BFL). Specifically, an Agent-Interpolation Initialization Module is designed to generate resilient queries capable of achieving a balance between foreground coverage and content learning. Additionally, a Hierarchical Query Fusion Decoder is designed to retain low overlap queries, mitigating the decrease in recall with the deepening of layers. Extensive experiments on ScanNetV2, ScanNet200, ScanNet++ and S3DIS datasets demonstrate the superior performance of BFL.