🤖 AI Summary
To address performance instability in few-shot 3D point cloud semantic segmentation caused by random prototype initialization, this paper proposes the Whitening Aggregation and Recovery Module (WARM). WARM explicitly aligns the distributions of support-set features and prototype tokens via feature whitening and coloring transformations before and after cross-attention, thereby enhancing semantic consistency and generalization. Unlike conventional methods relying on hand-crafted strategies such as farthest point sampling, WARM establishes a more robust and learnable prototype generation framework. Evaluated on multiple few-shot 3D segmentation benchmarks—including ScanNet and S3DIS—our method consistently surpasses existing state-of-the-art approaches, achieving average mIoU gains of 3.2–5.7 percentage points. Moreover, training and inference stability is significantly improved, empirically validating the effectiveness of distribution-aligned prototype learning.
📝 Abstract
Few-Shot 3D Point Cloud Segmentation (FS-PCS) aims to predict per-point labels for an unlabeled point cloud, given only a few labeled examples. To extract discriminative representations from the limited support set, existing methods have constructed prototypes using conventional algorithms such as farthest point sampling. However, we point out that its initial randomness significantly affects FS-PCS performance and that the prototype generation process remains underexplored despite its prevalence. This motivates us to investigate an advanced prototype generation method based on attention mechanism. Despite its potential, we found that vanilla module suffers from the distributional gap between learnable prototypical tokens and support features. To overcome this, we propose White Aggregation and Restoration Module (WARM), which resolves the misalignment by sandwiching cross-attention between whitening and coloring transformations. Specifically, whitening aligns the support features to prototypical tokens before attention process, and subsequently coloring restores the original distribution to the attended tokens. This simple yet effective design enables robust attention, thereby generating representative prototypes by capturing the semantic relationships among support features. Our method achieves state-of-the-art performance with a significant margin on multiple FS-PCS benchmarks, demonstrating its effectiveness through extensive experiments.