🤖 AI Summary
To address the limitations in high-fidelity single-view street-scene reconstruction and cross-scene generalization for autonomous driving, this paper proposes a sparse LiDAR-guided multimodal Gaussian neural rendering framework. Our method is the first to embed sparse LiDAR depth measurements directly into a differentiable Gaussian rasterization pipeline, establishing a joint image–LiDAR optimization objective. We further introduce a multimodal feature matching mechanism and a multi-scale Gaussian decoder to enable geometric–appearance co-modeling. The approach significantly improves the accuracy of Gaussian ellipsoid prediction and achieves state-of-the-art rendering quality (PSNR/SSIM) on Waymo and KITTI benchmarks. Moreover, it demonstrates strong zero-shot generalization to unseen scenes in novel-view synthesis. This work establishes a new paradigm for lightweight, robust single-view street-scene reconstruction.
📝 Abstract
We present a novel approach, termed ADGaussian, for generalizable street scene reconstruction. The proposed method enables high-quality rendering from single-view input. Unlike prior Gaussian Splatting methods that primarily focus on geometry refinement, we emphasize the importance of joint optimization of image and depth features for accurate Gaussian prediction. To this end, we first incorporate sparse LiDAR depth as an additional input modality, formulating the Gaussian prediction process as a joint learning framework of visual information and geometric clue. Furthermore, we propose a multi-modal feature matching strategy coupled with a multi-scale Gaussian decoding model to enhance the joint refinement of multi-modal features, thereby enabling efficient multi-modal Gaussian learning. Extensive experiments on two large-scale autonomous driving datasets, Waymo and KITTI, demonstrate that our ADGaussian achieves state-of-the-art performance and exhibits superior zero-shot generalization capabilities in novel-view shifting.