LiDAR-VGGT: Cross-Modal Coarse-to-Fine Fusion for Globally Consistent and Metric-Scale Dense Mapping

📅 2025-11-02

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Existing LiDAR-inertial-visual odometry (LIVO) suffers from sensitivity to extrinsic calibration, while 3D vision foundation models (e.g., VGGT) exhibit poor scalability in large-scale scenes and lack metric scale. To address these limitations, this paper proposes LiDAR-VGGT—a novel framework enabling the first tightly coupled mapping integration of LIVO and VGGT. Methodologically, it introduces a two-stage coarse-to-fine fusion pipeline: a pre-fusion module initializes metric scale, while a post-fusion module performs cross-modal pose joint optimization with bounding-box constraints and explicit scale regularization to suppress inter-modal scale distortion. Evaluated on multiple large-scale datasets, LiDAR-VGGT generates globally consistent, high-accuracy, metric-scale dense colored point clouds—outperforming state-of-the-art VGGT and LIVO baselines. The framework and an accompanying open-source point cloud evaluation toolkit are released.

Technology Category

Application Category

📝 Abstract

Reconstructing large-scale colored point clouds is an important task in robotics, supporting perception, navigation, and scene understanding. Despite advances in LiDAR inertial visual odometry (LIVO), its performance remains highly sensitive to extrinsic calibration. Meanwhile, 3D vision foundation models, such as VGGT, suffer from limited scalability in large environments and inherently lack metric scale. To overcome these limitations, we propose LiDAR-VGGT, a novel framework that tightly couples LiDAR inertial odometry with the state-of-the-art VGGT model through a two-stage coarse- to-fine fusion pipeline: First, a pre-fusion module with robust initialization refinement efficiently estimates VGGT poses and point clouds with coarse metric scale within each session. Then, a post-fusion module enhances cross-modal 3D similarity transformation, using bounding-box-based regularization to reduce scale distortions caused by inconsistent FOVs between LiDAR and camera sensors. Extensive experiments across multiple datasets demonstrate that LiDAR-VGGT achieves dense, globally consistent colored point clouds and outperforms both VGGT-based methods and LIVO baselines. The implementation of our proposed novel color point cloud evaluation toolkit will be released as open source.

Problem

Research questions and friction points this paper is trying to address.

Overcoming sensitivity to extrinsic calibration in LiDAR visual odometry systems

Addressing limited scalability and metric scale absence in 3D vision models

Mitigating scale distortions from inconsistent sensor field-of-views

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal coarse-to-fine fusion pipeline

Robust initialization refinement for VGGT poses

Bounding-box-based regularization reduces scale distortions

🔎 Similar Papers

VXP: Voxel-Cross-Pixel Large-scale Image-LiDAR Place Recognition