🤖 AI Summary
Open-world 3D scene understanding faces two key challenges: the scarcity of annotated point clouds and the difficulty of cross-modal semantic alignment across heterogeneous modalities. To address these, we propose an image-bridged, region-level vision-language supervision paradigm that leverages images as intermediaries to enable point cloud–text alignment—eliminating the need for manually curated point cloud–text pairs. Our method introduces a novel logit/feature distillation mechanism and an explicit vision–point cloud matching module to mitigate projection misalignment caused by imperfect 2D–3D registration. Additionally, we design a two-stage collaborative training strategy coupled with a unified four-task joint loss. On both Base-Annotated and Annotation-Free 3D semantic segmentation benchmarks, our approach achieves new state-of-the-art performance, outperforming prior methods by +15.6% and +14.8% in mean IoU, respectively—demonstrating significantly enhanced generalization under extreme label scarcity.
📝 Abstract
We present UniPLV, a powerful framework that unifies point clouds, images and text in a single learning paradigm for open-world 3D scene understanding. UniPLV employs the image modal as a bridge to co-embed 3D points with pre-aligned images and text in a shared feature space without requiring carefully crafted point cloud text pairs. To accomplish multi-modal alignment, we propose two key strategies:(i) logit and feature distillation modules between images and point clouds, and (ii) a vison-point matching module is given to explicitly correct the misalignment caused by points to pixels projection. To further improve the performance of our unified framework, we adopt four task-specific losses and a two-stage training strategy. Extensive experiments show that our method outperforms the state-of-the-art methods by an average of 15.6% and 14.8% for semantic segmentation over Base-Annotated and Annotation-Free tasks, respectively. The code will be released later.