UniD-Shift: Towards Unified Semantic Segmentation via Interpretable Share-Private Multimodal Decomposition

📅 2026-05-08

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work addresses the challenge of cross-modal alignment in multimodal semantic segmentation, which is hindered by the sparsity of LiDAR point clouds and the viewpoint dependency of images. To this end, the authors propose a unified 2D–3D multimodal semantic segmentation framework that explicitly decomposes shared and modality-specific feature subspaces. The approach integrates complementary representations from a SAM-based visual encoder and an SPTNet geometric encoder, enhanced by a lightweight attention module to promote cross-modal consistent representations. An interpretable shared–private decomposition mechanism preserves modality-specific characteristics while strengthening semantic alignment, and is combined with a regularization-based training strategy to ensure robust generalization under distribution shifts. Experiments demonstrate that the method outperforms state-of-the-art approaches on both SemanticKITTI and nuScenes benchmarks, achieving high accuracy, computational efficiency, and cross-domain stability.

📝 Abstract

Semantic segmentation of large-scale 3D point clouds is crucial for applications such as autonomous driving and urban digital twins. However, the sparse sampling pattern of LiDAR and the view-dependent geometric distortion in image observations complicate cross-modal alignment and hinder stable fusion. Inspired by the fact that 2D images captured by cameras are representations of the 3D world, we recognize that the features learned from 2D and 3D segmentation share some common semantics, while other aspects remain modality-specific. This insight motivates a unified multimodal framework for joint 2D-3D semantic segmentation. We combine a SAM-based vision encoder with a SPTNet-based geometric encoder to extract complementary semantic and geometric representations. The resulting features from both modalities are explicitly decomposed into shared and private subspaces, where the shared components summarize semantic factors common to both domains, and the private components preserve properties that are unique to each modality. A lightweight attention-based fusion module aggregates the shared features into a consistent cross-modal representation, and a regularized training objective ensures both semantic alignment and subspace independence. Experiments on the SemanticKITTI and nuScenes benchmarks demonstrate consistent improvements in segmentation accuracy over representative multimodal baselines, accompanied by competitive computational efficiency. Cross-domain evaluation on nuScenes USA-Singapore shows stable performance under distribution shifts, demonstrating strong generalization. The implementation code is publicly available at: https://github.com/shuaizhang69/UniD-Shift.

Problem

Research questions and friction points this paper is trying to address.

semantic segmentation

multimodal fusion

cross-modal alignment

3D point clouds

domain shift

Innovation

Methods, ideas, or system contributions that make the work stand out.

share-private decomposition

multimodal fusion

semantic segmentation