VIM-GS: Visual-Inertial Monocular Gaussian Splatting via Object-level Guidance in Large Scenes

📅 2025-09-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of generating high-precision dense depth maps from monocular images for high-fidelity Gaussian Splatting (GS) rendering in large-scale scenes. To this end, we propose an end-to-end depth optimization framework that jointly fuses sparse but accurate depth estimates from visual-inertial Structure-from-Motion (SfM) with dense yet coarse depth predictions from large foundation models. Our method introduces two key innovations: (i) an object-level segmentation-guided depth propagation algorithm to preserve semantic and geometric coherence across static and dynamic objects, and (ii) a dynamic depth refinement module that adaptively corrects depth errors in motion-prone and structurally complex regions. Evaluated on both public and newly constructed large-scale datasets, our approach significantly improves geometric consistency and texture fidelity in novel-view synthesis. It outperforms state-of-the-art monocular GS methods, especially in scenes with intricate geometry and dynamic content.

Technology Category

Application Category

📝 Abstract
VIM-GS is a Gaussian Splatting (GS) framework using monocular images for novel-view synthesis (NVS) in large scenes. GS typically requires accurate depth to initiate Gaussian ellipsoids using RGB-D/stereo cameras. Their limited depth sensing range makes it difficult for GS to work in large scenes. Monocular images, however, lack depth to guide the learning and lead to inferior NVS results. Although large foundation models (LFMs) for monocular depth estimation are available, they suffer from cross-frame inconsistency, inaccuracy for distant scenes, and ambiguity in deceptive texture cues. This paper aims to generate dense, accurate depth images from monocular RGB inputs for high-definite GS rendering. The key idea is to leverage the accurate but sparse depth from visual-inertial Structure-from-Motion (SfM) to refine the dense but coarse depth from LFMs. To bridge the sparse input and dense output, we propose an object-segmented depth propagation algorithm that renders the depth of pixels of structured objects. Then we develop a dynamic depth refinement module to handle the crippled SfM depth of dynamic objects and refine the coarse LFM depth. Experiments using public and customized datasets demonstrate the superior rendering quality of VIM-GS in large scenes.
Problem

Research questions and friction points this paper is trying to address.

Generating accurate dense depth from monocular images
Overcoming cross-frame inconsistency in monocular depth estimation
Enabling Gaussian Splatting in large scenes with sparse inputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines sparse SfM depth with dense LFM depth
Uses object-segmented depth propagation algorithm
Develops dynamic depth refinement for moving objects
🔎 Similar Papers
No similar papers found.
S
Shengkai Zhang
State Key Laboratory of Maritime Technology and Safety, Wuhan University of Technology, Wuhan, China
Y
Yuhe Liu
State Key Laboratory of Maritime Technology and Safety, Wuhan University of Technology, Wuhan, China
G
Guanjun Wu
Huazhong University of Science and Technology, Wuhan, China
Jianhua He
Jianhua He
University of Essex
5G/6Gconnected autonomous vehiclesIoTmobile edge computingdeep learning
Xinggang Wang
Xinggang Wang
Professor, Huazhong University of Science and Technology
Artificial IntelligenceComputer VisionAutonomous DrivingObject DetectionObject Segmentation
M
Mozi Chen
State Key Laboratory of Maritime Technology and Safety, Wuhan University of Technology, Wuhan, China
K
Kezhong Liu
State Key Laboratory of Maritime Technology and Safety, Wuhan University of Technology, Wuhan, China