🤖 AI Summary
This work proposes the first end-to-end unified framework for simultaneously reconstructing complete 3D geometry, spatially varying physically based materials, and environmental illumination from sparse multi-view images, overcoming the complexity and high computational cost of traditional pipelines that handle these components separately. The method introduces a Transformer-based cross-view conditional fusion architecture, a dual-path prediction strategy, and a differentiable Monte Carlo multiple importance sampling renderer. By leveraging a hybrid training scheme combining synthetic and real-world data, the approach significantly enhances the disentanglement of materials and lighting. Experiments demonstrate that the system generates high-quality, relightable 3D assets in under one second, outperforming existing methods in geometric accuracy, material detail, and lighting generalization.
📝 Abstract
Reconstructing 3D assets from images has long required separate pipelines for geometry reconstruction, material estimation, and illumination recovery, each with distinct limitations and computational overhead. We present ReLi3D, the first unified end-to-end pipeline that simultaneously reconstructs complete 3D geometry, spatially-varying physically-based materials, and environment illumination from sparse multi-view images in under one second. Our key insight is that multi-view constraints can dramatically improve material and illumination disentanglement, a problem that remains fundamentally ill-posed for single-image methods. Key to our approach is the fusion of the multi-view input via a transformer cross-conditioning architecture, followed by a novel unified two-path prediction strategy. The first path predicts the object's structure and appearance, while the second path predicts the environment illumination from image background or object reflections. This, combined with a differentiable Monte Carlo multiple importance sampling renderer, creates an optimal illumination disentanglement training pipeline. In addition, with our mixed domain training protocol, which combines synthetic PBR datasets with real-world RGB captures, we establish generalizable results in geometry, material accuracy, and illumination quality. By unifying previously separate reconstruction tasks into a single feed-forward pass, we enable near-instantaneous generation of complete, relightable 3D assets. Project Page: https://reli3d.jdihlmann.com/