S-BEVLoc: BEV-based Self-supervised Framework for Large-scale LiDAR Global Localization

📅 2025-09-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high annotation cost of LiDAR-based global localization—stemming from its reliance on high-precision GPS/SLAM ground-truth poses—this paper proposes the first BEV-based self-supervised end-to-end learning framework. Given only a single BEV image and its corresponding geographic coordinate, the method automatically constructs triplets by leveraging known geographic distances between keypoint regions. A SoftCos loss is introduced to enhance local feature discriminability, while a CNN-NetVLAD architecture generates robust global descriptors. Crucially, no ground-truth pose supervision is required, significantly improving scalability and deployment efficiency. Evaluated on large-scale KITTI and NCLT benchmarks, the method achieves state-of-the-art performance in both place recognition and global localization tasks, with strong loop-closure detection robustness.

Technology Category

Application Category

📝 Abstract
LiDAR-based global localization is an essential component of simultaneous localization and mapping (SLAM), which helps loop closure and re-localization. Current approaches rely on ground-truth poses obtained from GPS or SLAM odometry to supervise network training. Despite the great success of these supervised approaches, substantial cost and effort are required for high-precision ground-truth pose acquisition. In this work, we propose S-BEVLoc, a novel self-supervised framework based on bird's-eye view (BEV) for LiDAR global localization, which eliminates the need for ground-truth poses and is highly scalable. We construct training triplets from single BEV images by leveraging the known geographic distances between keypoint-centered BEV patches. Convolutional neural network (CNN) is used to extract local features, and NetVLAD is employed to aggregate global descriptors. Moreover, we introduce SoftCos loss to enhance learning from the generated triplets. Experimental results on the large-scale KITTI and NCLT datasets show that S-BEVLoc achieves state-of-the-art performance in place recognition, loop closure, and global localization tasks, while offering scalability that would require extra effort for supervised approaches.
Problem

Research questions and friction points this paper is trying to address.

Eliminates need for costly ground-truth pose supervision in LiDAR localization
Addresses scalability limitations in supervised BEV-based localization approaches
Solves large-scale localization without GPS/SLAM odometry dependency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised BEV framework
Geographic distance triplet training
SoftCos loss for feature learning
🔎 Similar Papers
No similar papers found.
Chenghao Zhang
Chenghao Zhang
Renmin University of China
Natural Language ProcessingInformation RetrievalMultimodal
Lun Luo
Lun Luo
Zhejiang University
SLAMPlace Recognition
Si-Yuan Cao
Si-Yuan Cao
Zhejiang University
image alignmenthomography estimationimage fusionplace recognition
Xiaokai Bai
Xiaokai Bai
Zhejiang University Ph.D student
Multimodal Fusion3D object detection4D Radar Perceptionautonomous driving
Y
Yuncheng Jin
College of Information Engineering, China Jiliang University, Hangzhou 310018, China
Z
Zhu Yu
College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou 310027, China
B
Beinan Yu
College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou 310027, China, and also with the Jinhua Institute of Zhejiang University, Jinhua 321299, China
Yisen Wang
Yisen Wang
Assistant Professor, Peking University
Machine LearningSelf-Supervised LearningLarge Language ModelsSafety
H
Hui-Liang Shen
College of Information Science and Electronic Engineering, Zhejiang University, Hangzhou 310027, China, and also with the Jinhua Institute of Zhejiang University, Jinhua 321299, China