Exploring Modality Guidance to Enhance VFM-based Feature Fusion for UDA in 3D Semantic Segmentation

📅 2025-04-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses unsupervised domain adaptation (UDA) for LiDAR point cloud 3D semantic segmentation. Methodologically, it introduces a dynamic multimodal feature fusion framework that leverages vision foundation models (VFMs) to extract robust, cross-domain 2D image and 3D point cloud features, establishing paired cross-modal representations. A modality-guided dynamic gating mechanism is designed to adaptively weight and fuse dual-stream features based on target-domain characteristics, with joint optimization of the 3D backbone network using both source-domain labeled and target-domain unlabeled data. To the best of our knowledge, this is the first work to systematically incorporate VFM-derived cross-modal priors into 3D UDA segmentation, significantly enhancing domain transfer robustness. Extensive experiments demonstrate consistent improvements across multiple benchmarks, achieving an average mIoU gain of +6.5 percentage points over current state-of-the-art methods.

Technology Category

Application Category

📝 Abstract
Vision Foundation Models (VFMs) have become a de facto choice for many downstream vision tasks, like image classification, image segmentation, and object localization. However, they can also provide significant utility for downstream 3D tasks that can leverage the cross-modal information (e.g., from paired image data). In our work, we further explore the utility of VFMs for adapting from a labeled source to unlabeled target data for the task of LiDAR-based 3D semantic segmentation. Our method consumes paired 2D-3D (image and point cloud) data and relies on the robust (cross-domain) features from a VFM to train a 3D backbone on a mix of labeled source and unlabeled target data. At the heart of our method lies a fusion network that is guided by both the image and point cloud streams, with their relative contributions adjusted based on the target domain. We extensively compare our proposed methodology with different state-of-the-art methods in several settings and achieve strong performance gains. For example, achieving an average improvement of 6.5 mIoU (over all tasks), when compared with the previous state-of-the-art.
Problem

Research questions and friction points this paper is trying to address.

Enhancing VFM-based feature fusion for 3D semantic segmentation UDA
Leveraging cross-modal (2D-3D) data to improve LiDAR segmentation
Adapting labeled source to unlabeled target data via modality guidance
Innovation

Methods, ideas, or system contributions that make the work stand out.

VFM-based cross-modal feature fusion
Modality-guided 2D-3D data integration
Domain-adaptive LiDAR segmentation training
🔎 Similar Papers
No similar papers found.