🤖 AI Summary
To address the weak generalization capability of unimodal point clouds and poor adaptability to novel classes in few-shot 3D point cloud semantic segmentation (FS-PCS), this paper introduces, for the first time, a multimodal few-shot 3D segmentation framework that incorporates readily available textual labels and 2D image modalities. Methodologically, we propose a Multimodal Correlation Fusion (MCF) module and a Multimodal Semantic Fusion (MSF) module to achieve cross-modal feature alignment and complementarity; additionally, we introduce a Test-time Adaptive Cross-modal Calibration (TACC) mechanism to dynamically refine predictions for unseen classes. Our architecture employs a shared backbone with dual-head visual encoders jointly optimized with a pretrained text encoder. Extensive experiments on S3DIS and ScanNet demonstrate that our approach significantly outperforms unimodal baselines, validating the substantial performance gains conferred by multimodal information in FS-PCS.
📝 Abstract
Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal annotated support samples. While existing FS-PCS methods have shown promise, they primarily focus on unimodal point cloud inputs, overlooking the potential benefits of leveraging multimodal information. In this paper, we address this gap by introducing a multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality. Under this easy-to-achieve setup, we present the MultiModal Few-Shot SegNet (MM-FSS), a model effectively harnessing complementary information from multiple modalities. MM-FSS employs a shared backbone with two heads to extract intermodal and unimodal visual features, and a pretrained text encoder to generate text embeddings. To fully exploit the multimodal information, we propose a Multimodal Correlation Fusion (MCF) module to generate multimodal correlations, and a Multimodal Semantic Fusion (MSF) module to refine the correlations using text-aware semantic guidance. Additionally, we propose a simple yet effective Test-time Adaptive Cross-modal Calibration (TACC) technique to mitigate training bias, further improving generalization. Experimental results on S3DIS and ScanNet datasets demonstrate significant performance improvements achieved by our method. The efficacy of our approach indicates the benefits of leveraging commonly-ignored free modalities for FS-PCS, providing valuable insights for future research. The code is available at https://github.com/ZhaochongAn/Multimodality-3D-Few-Shot