VIM-GS: Visual-Inertial Monocular Gaussian Splatting via Object-level Guidance in Large Scenes

📅 2025-09-08

📈 Citations: 0

✨ Influential: 0

career value

239K/year

🤖 AI Summary

This work addresses the challenge of generating high-precision dense depth maps from monocular images for high-fidelity Gaussian Splatting (GS) rendering in large-scale scenes. To this end, we propose an end-to-end depth optimization framework that jointly fuses sparse but accurate depth estimates from visual-inertial Structure-from-Motion (SfM) with dense yet coarse depth predictions from large foundation models. Our method introduces two key innovations: (i) an object-level segmentation-guided depth propagation algorithm to preserve semantic and geometric coherence across static and dynamic objects, and (ii) a dynamic depth refinement module that adaptively corrects depth errors in motion-prone and structurally complex regions. Evaluated on both public and newly constructed large-scale datasets, our approach significantly improves geometric consistency and texture fidelity in novel-view synthesis. It outperforms state-of-the-art monocular GS methods, especially in scenes with intricate geometry and dynamic content.

Technology Category

Application Category

📝 Abstract

VIM-GS is a Gaussian Splatting (GS) framework using monocular images for novel-view synthesis (NVS) in large scenes. GS typically requires accurate depth to initiate Gaussian ellipsoids using RGB-D/stereo cameras. Their limited depth sensing range makes it difficult for GS to work in large scenes. Monocular images, however, lack depth to guide the learning and lead to inferior NVS results. Although large foundation models (LFMs) for monocular depth estimation are available, they suffer from cross-frame inconsistency, inaccuracy for distant scenes, and ambiguity in deceptive texture cues. This paper aims to generate dense, accurate depth images from monocular RGB inputs for high-definite GS rendering. The key idea is to leverage the accurate but sparse depth from visual-inertial Structure-from-Motion (SfM) to refine the dense but coarse depth from LFMs. To bridge the sparse input and dense output, we propose an object-segmented depth propagation algorithm that renders the depth of pixels of structured objects. Then we develop a dynamic depth refinement module to handle the crippled SfM depth of dynamic objects and refine the coarse LFM depth. Experiments using public and customized datasets demonstrate the superior rendering quality of VIM-GS in large scenes.

Problem

Research questions and friction points this paper is trying to address.

Generating accurate dense depth from monocular images

Overcoming cross-frame inconsistency in monocular depth estimation

Enabling Gaussian Splatting in large scenes with sparse inputs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines sparse SfM depth with dense LFM depth

Uses object-segmented depth propagation algorithm

Develops dynamic depth refinement for moving objects

🔎 Similar Papers

GEVO: Memory-Efficient Monocular Visual Odometry Using Gaussians