3D Learnable Supertoken Transformer for LiDAR Point Cloud Scene Segmentation

📅 2024-05-23
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low efficiency and weak semantic modeling in large-scale LiDAR point cloud scene segmentation, this paper proposes 3DLST, an efficient 3D Transformer framework. Methodologically, it introduces three key innovations: (1) a dynamic learnable super-token mechanism enabling semantic-aware adaptive clustering and reconstruction; (2) cross-attention-guided upsampling (CAU) to enhance fine-grained feature recovery; and (3) a Transformer-optimized W-net architecture—replacing conventional U-nets—to improve multi-scale representation learning. Notably, the Dynamic Super-token Optimization (DSO) module is introduced for the first time. Evaluated on MS-LiDAR, DALES, and Toronto-3D benchmarks, 3DLST achieves state-of-the-art performance with 89.3% mean F1 score and 80.2%/80.4% mIoU, while attaining a 5× speedup in inference over prior best methods.

Technology Category

Application Category

📝 Abstract
3D Transformers have achieved great success in point cloud understanding and representation. However, there is still considerable scope for further development in effective and efficient Transformers for large-scale LiDAR point cloud scene segmentation. This paper proposes a novel 3D Transformer framework, named 3D Learnable Supertoken Transformer (3DLST). The key contributions are summarized as follows. Firstly, we introduce the first Dynamic Supertoken Optimization (DSO) block for efficient token clustering and aggregating, where the learnable supertoken definition avoids the time-consuming pre-processing of traditional superpoint generation. Since the learnable supertokens can be dynamically optimized by multi-level deep features during network learning, they are tailored to the semantic homogeneity-aware token clustering. Secondly, an efficient Cross-Attention-guided Upsampling (CAU) block is proposed for token reconstruction from optimized supertokens. Thirdly, the 3DLST is equipped with a novel W-net architecture instead of the common U-net design, which is more suitable for Transformer-based feature learning. The SOTA performance on three challenging LiDAR datasets (airborne MultiSpectral LiDAR (MS-LiDAR) (89.3% of the average F1 score), DALES (80.2% of mIoU), and Toronto-3D dataset (80.4% of mIoU)) demonstrate the superiority of 3DLST and its strong adaptability to various LiDAR point cloud data (airborne MS-LiDAR, aerial LiDAR, and vehicle-mounted LiDAR data). Furthermore, 3DLST also achieves satisfactory results in terms of algorithm efficiency, which is up to 5x faster than previous best-performing methods.
Problem

Research questions and friction points this paper is trying to address.

LiDAR point cloud
efficient analysis
scene segmentation
Innovation

Methods, ideas, or system contributions that make the work stand out.

3D Learnable Super Token Transformer
Dynamic Super Token Optimization
Cross-Attention Upsampling Unit
🔎 Similar Papers
No similar papers found.
Dening Lu
Dening Lu
University of Waterloo
computer graphics
J
Jun Zhou
K
K. Gao
L
Linlin Xu
J
Jonathan Li