LiDAR-VGGT: Cross-Modal Coarse-to-Fine Fusion for Globally Consistent and Metric-Scale Dense Mapping

📅 2025-11-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LiDAR-inertial-visual odometry (LIVO) suffers from sensitivity to extrinsic calibration, while 3D vision foundation models (e.g., VGGT) exhibit poor scalability in large-scale scenes and lack metric scale. To address these limitations, this paper proposes LiDAR-VGGT—a novel framework enabling the first tightly coupled mapping integration of LIVO and VGGT. Methodologically, it introduces a two-stage coarse-to-fine fusion pipeline: a pre-fusion module initializes metric scale, while a post-fusion module performs cross-modal pose joint optimization with bounding-box constraints and explicit scale regularization to suppress inter-modal scale distortion. Evaluated on multiple large-scale datasets, LiDAR-VGGT generates globally consistent, high-accuracy, metric-scale dense colored point clouds—outperforming state-of-the-art VGGT and LIVO baselines. The framework and an accompanying open-source point cloud evaluation toolkit are released.

Technology Category

Application Category

📝 Abstract
Reconstructing large-scale colored point clouds is an important task in robotics, supporting perception, navigation, and scene understanding. Despite advances in LiDAR inertial visual odometry (LIVO), its performance remains highly sensitive to extrinsic calibration. Meanwhile, 3D vision foundation models, such as VGGT, suffer from limited scalability in large environments and inherently lack metric scale. To overcome these limitations, we propose LiDAR-VGGT, a novel framework that tightly couples LiDAR inertial odometry with the state-of-the-art VGGT model through a two-stage coarse- to-fine fusion pipeline: First, a pre-fusion module with robust initialization refinement efficiently estimates VGGT poses and point clouds with coarse metric scale within each session. Then, a post-fusion module enhances cross-modal 3D similarity transformation, using bounding-box-based regularization to reduce scale distortions caused by inconsistent FOVs between LiDAR and camera sensors. Extensive experiments across multiple datasets demonstrate that LiDAR-VGGT achieves dense, globally consistent colored point clouds and outperforms both VGGT-based methods and LIVO baselines. The implementation of our proposed novel color point cloud evaluation toolkit will be released as open source.
Problem

Research questions and friction points this paper is trying to address.

Overcoming sensitivity to extrinsic calibration in LiDAR visual odometry systems
Addressing limited scalability and metric scale absence in 3D vision models
Mitigating scale distortions from inconsistent sensor field-of-views
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal coarse-to-fine fusion pipeline
Robust initialization refinement for VGGT poses
Bounding-box-based regularization reduces scale distortions
🔎 Similar Papers
L
Lijie Wang
State Key Laboratory of Industrial Control Technology, Institute of Cyber-Systems and Control, Zhejiang University, Hangzhou, 310027, China
L
Lianjie Guo
State Key Laboratory of Industrial Control Technology, Institute of Cyber-Systems and Control, Zhejiang University, Hangzhou, 310027, China
Ziyi Xu
Ziyi Xu
École Polytechnique Fédérale de Lausanne (EPFL)
Qianhao Wang
Qianhao Wang
PhD, Zhejiang University
Robotics
F
Fei Gao
State Key Laboratory of Industrial Control Technology, Institute of Cyber-Systems and Control, Zhejiang University, Hangzhou, 310027, China
Xieyuanli Chen
Xieyuanli Chen
Associate Professor, NUDT, China
RoboticsSLAMLocalizationLiDAR PerceptionRobot Learning