🤖 AI Summary
Existing 3D reconstruction methods suffer from low efficiency and limited quality due to indirect geometric learning and coupled geometry-appearance modeling. To address this, we propose the first end-to-end jointly optimized framework integrating explicit triangular meshes with 3D Gaussian points: differentiable 3D Gaussians are rigidly bound to mesh faces and jointly optimized under photometric supervision to reconstruct both geometry and surface appearance. Our approach breaks the conventional paradigm of decoupled geometry and appearance modeling, enabling efficient, high-fidelity reconstruction and real-time rendering. Quantitatively, it achieves a +1.8 dB PSNR improvement over prior methods on the DTU and BlendedMVS benchmarks. Moreover, the framework supports interactive mesh editing and incremental updates for dynamic scenes, significantly enhancing reconstruction efficiency and editing flexibility.
📝 Abstract
Accurately reconstructing a 3D scene including explicit geometry information is both attractive and challenging. Geometry reconstruction can benefit from incorporating differentiable appearance models, such as Neural Radiance Fields and 3D Gaussian Splatting (3DGS). However, existing methods encounter efficiency issues due to indirect geometry learning and the paradigm of separately modeling geometry and surface appearance. In this work, we propose a learnable scene model that incorporates 3DGS with an explicit geometry representation, namely a mesh. Our model learns the mesh and appearance in an end-to-end manner, where we bind 3D Gaussians to the mesh faces and perform differentiable rendering of 3DGS to obtain photometric supervision. The model creates an effective information pathway to supervise the learning of both 3DGS and mesh. Experimental results demonstrate that the learned scene model not only improves efficiency and rendering quality but also enables manipulation via the explicit mesh. In addition, our model has a unique advantage in adapting to scene updates, thanks to the end-to-end learning of both mesh and appearance.