Label-Efficient LiDAR Semantic Segmentation with 2D-3D Vision Transformer Adapters

📅 2025-03-05

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

To address three key challenges in LiDAR semantic segmentation—scarcity of pretraining data, difficulty in transferring knowledge from vision models, and limited generalization due to rigid point-cloud-specific architectures—this paper proposes BALViT. BALViT introduces a novel dual-path LiDAR encoding framework: it employs a frozen Vision Transformer (ViT) as a universal feature encoder, jointly processing range-view and bird’s-eye-view representations. Crucially, it incorporates a cross-attention-driven 2D–3D Vision Transformer adapter to continuously inject visual priors into LiDAR representations and enforce cross-modal alignment. Furthermore, contrastive domain-adaptive fine-tuning enables robust few-shot generalization under extremely low annotation budgets (1%–10%). Extensive experiments demonstrate state-of-the-art performance on SemanticKITTI and nuScenes. The code and pretrained models are publicly released.

Technology Category

Application Category

📝 Abstract

LiDAR semantic segmentation models are typically trained from random initialization as universal pre-training is hindered by the lack of large, diverse datasets. Moreover, most point cloud segmentation architectures incorporate custom network layers, limiting the transferability of advances from vision-based architectures. Inspired by recent advances in universal foundation models, we propose BALViT, a novel approach that leverages frozen vision models as amodal feature encoders for learning strong LiDAR encoders. Specifically, BALViT incorporates both range-view and bird's-eye-view LiDAR encoding mechanisms, which we combine through a novel 2D-3D adapter. While the range-view features are processed through a frozen image backbone, our bird's-eye-view branch enhances them through multiple cross-attention interactions. Thereby, we continuously improve the vision network with domain-dependent knowledge, resulting in a strong label-efficient LiDAR encoding mechanism. Extensive evaluations of BALViT on the SemanticKITTI and nuScenes benchmarks demonstrate that it outperforms state-of-the-art methods on small data regimes. We make the code and models publicly available at: http://balvit.cs.uni-freiburg.de.

Problem

Research questions and friction points this paper is trying to address.

Lack of large datasets for LiDAR semantic segmentation pre-training.

Limited transferability of vision-based architectures to point cloud segmentation.

Need for label-efficient LiDAR encoding mechanisms.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses frozen vision models as feature encoders

Combines range-view and bird's-eye-view LiDAR encoding

Enhances features with cross-attention interactions

🔎 Similar Papers

3D Learnable Supertoken Transformer for LiDAR Point Cloud Scene Segmentation