🤖 AI Summary
Neural implicit methods struggle to reconstruct fine-grained geometry, sharp edges, and thin structures from sparse multi-view RGB inputs—particularly with only two views (front and back).
Method: Moving beyond conventional zero-order geometric constraints (e.g., point-projection consistency), we introduce first-order differential constraints—specifically surface normals—as explicit supervision for neural implicit modeling. We estimate monocular depth using Depth Anything and derive approximate image-space surface normals, formulating a normal consistency loss that jointly optimizes the first-order differentiable properties of signed distance functions (SDFs) or NeRF-like implicit fields.
Results: Evaluated on both synthetic and real-world datasets, our method achieves high-fidelity 3D surface reconstruction from merely two RGB images. It significantly outperforms state-of-the-art approaches in PSNR, Chamfer distance, and visual quality, demonstrating that normal supervision is critical for recovering fine-scale geometric details.
📝 Abstract
Neural implicit representations have emerged as a powerful paradigm for 3D reconstruction. However, despite their success, existing methods fail to capture fine geometric details and thin structures, especially in scenarios where only sparse RGB views of the objects of interest are available. We hypothesize that current methods for learning neural implicit representations from RGB or RGBD images produce 3D surfaces with missing parts and details because they only rely on 0-order differential properties, i.e. the 3D surface points and their projections, as supervisory signals. Such properties, however, do not capture the local 3D geometry around the points and also ignore the interactions between points. This paper demonstrates that training neural representations with first-order differential properties, i.e. surface normals, leads to highly accurate 3D surface reconstruction even in situations where only as few as two RGB (front and back) images are available. Given multiview RGB images of an object of interest, we first compute the approximate surface normals in the image space using the gradient of the depth maps produced using an off-the-shelf monocular depth estimator such as Depth Anything model. An implicit surface regressor is then trained using a loss function that enforces the first-order differential properties of the regressed surface to match those estimated from Depth Anything. Our extensive experiments on a wide range of real and synthetic datasets show that the proposed method achieves an unprecedented level of reconstruction accuracy even when using as few as two RGB views. The detailed ablation study also demonstrates that normal-based supervision plays a key role in this significant improvement in performance, enabling the 3D reconstruction of intricate geometric details and thin structures that were previously challenging to capture.