MuDG: Taming Multi-modal Diffusion with Gaussian Splatting for Urban Scene Reconstruction

📅 2025-03-13

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

To address the challenges of reconstruction distortion under large viewpoint shifts, temporal incoherence in generative methods, and poor scene controllability in autonomous driving urban scene reconstruction, this paper proposes a synergistic framework integrating multimodal diffusion models with 3D Gaussian Splatting (3DGS). We introduce the first cross-modal video diffusion model that jointly leverages LiDAR point clouds, RGB images, and geometric priors to enable feedforward novel-view synthesis without per-scene optimization. Concurrently, diffusion-based priors are employed to inversely refine the 3DGS representation, significantly enhancing rendering robustness under extreme viewpoints. The method incorporates end-to-end differentiable rendering supervision, joint depth/semantic/RGB generation, and LiDAR-RGB conditional modeling. Evaluated on the Open Waymo dataset, our approach achieves state-of-the-art performance in reconstruction accuracy, novel-view synthesis quality, and temporal consistency.

Technology Category

Application Category

📝 Abstract

Recent breakthroughs in radiance fields have significantly advanced 3D scene reconstruction and novel view synthesis (NVS) in autonomous driving. Nevertheless, critical limitations persist: reconstruction-based methods exhibit substantial performance deterioration under significant viewpoint deviations from training trajectories, while generation-based techniques struggle with temporal coherence and precise scene controllability. To overcome these challenges, we present MuDG, an innovative framework that integrates Multi-modal Diffusion model with Gaussian Splatting (GS) for Urban Scene Reconstruction. MuDG leverages aggregated LiDAR point clouds with RGB and geometric priors to condition a multi-modal video diffusion model, synthesizing photorealistic RGB, depth, and semantic outputs for novel viewpoints. This synthesis pipeline enables feed-forward NVS without computationally intensive per-scene optimization, providing comprehensive supervision signals to refine 3DGS representations for rendering robustness enhancement under extreme viewpoint changes. Experiments on the Open Waymo Dataset demonstrate that MuDG outperforms existing methods in both reconstruction and synthesis quality.

Problem

Research questions and friction points this paper is trying to address.

Improves 3D scene reconstruction under extreme viewpoint changes

Enhances temporal coherence and scene controllability in synthesis

Reduces computational cost for novel view synthesis without per-scene optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Multi-modal Diffusion with Gaussian Splatting

Uses LiDAR, RGB, and geometric priors for synthesis

Enhances 3DGS rendering under extreme viewpoint changes

🔎 Similar Papers

OmniRe: Omni Urban Scene Reconstruction