M^3: Dense Matching Meets Multi-View Foundation Models for Monocular Gaussian Splatting SLAM

📅 2026-03-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Monocular video-based pose estimation and geometric reconstruction in dynamic scenes often struggle to simultaneously achieve high accuracy and computational efficiency. To address this challenge, this work proposes a novel framework that integrates a multi-view foundation model with Gaussian Splatting SLAM. The approach introduces, for the first time, a dense matching head to generate pixel-level fine-grained correspondences, complemented by a dynamic region suppression mechanism and a cross-inference intrinsic parameter alignment strategy. These innovations significantly enhance tracking stability and geometric reconstruction quality. Evaluated on benchmarks such as ScanNet++, the method achieves state-of-the-art performance, reducing the ATE RMSE by 64.3% compared to VGGT-SLAM 2.0 and surpassing ARTDECO by 2.11 dB in PSNR.

Technology Category

Application Category

📝 Abstract
Streaming reconstruction from uncalibrated monocular video remains challenging, as it requires both high-precision pose estimation and computationally efficient online refinement in dynamic environments. While coupling 3D foundation models with SLAM frameworks is a promising paradigm, a critical bottleneck persists: most multi-view foundation models estimate poses in a feed-forward manner, yielding pixel-level correspondences that lack the requisite precision for rigorous geometric optimization. To address this, we present M^3, which augments the Multi-view foundation model with a dedicated Matching head to facilitate fine-grained dense correspondences and integrates it into a robust Monocular Gaussian Splatting SLAM. M^3 further enhances tracking stability by incorporating dynamic area suppression and cross-inference intrinsic alignment. Extensive experiments on diverse indoor and outdoor benchmarks demonstrate state-of-the-art accuracy in both pose estimation and scene reconstruction. Notably, M^3 reduces ATE RMSE by 64.3% compared to VGGT-SLAM 2.0 and outperforms ARTDECO by 2.11 dB in PSNR on the ScanNet++ dataset.
Problem

Research questions and friction points this paper is trying to address.

monocular SLAM
streaming reconstruction
pose estimation
dense matching
multi-view foundation models
Innovation

Methods, ideas, or system contributions that make the work stand out.

dense matching
multi-view foundation models
monocular SLAM
Gaussian splatting
pose estimation
🔎 Similar Papers
No similar papers found.
Kerui Ren
Kerui Ren
Shanghai Jiao Tong University, Shanghai AI Laboratory
3D ReconstructionNeural Rendering
Guanghao Li
Guanghao Li
Fudan University
Graphics
C
Changjian Jiang
Shanghai Artificial Intelligence Laboratory, Zhejiang University
Y
Yingxiang Xu
Beijing Institute of Technology, Shanghai Artificial Intelligence Laboratory
Tao Lu
Tao Lu
Shanghai AI Lab
3D
L
Linning Xu
The Chinese University of Hong Kong, Shanghai Artificial Intelligence Laboratory
Junting Dong
Junting Dong
Zhejiang University
Computer Vision
J
Jiangmiao Pang
Shanghai Artificial Intelligence Laboratory
Mulin Yu
Mulin Yu
Shanghai AILab; INRIA
3D reconstruction and 3D repairing
Bo Dai
Bo Dai
The University of Hong Kong
Generative AIInteractive AIReal2Sim2Real