🤖 AI Summary
Existing large reconstruction models (LRMs) achieve strong geometric reconstruction under sparse views but struggle to accurately recover unseen regions, glossy materials, and renderable 3D content supporting real-time relighting. This paper proposes the first millisecond-scale joint reconstruction framework for sparse-view inputs, simultaneously generating high-fidelity geometry (represented as a hexa-plane neural signed distance field), spatially varying material properties, and view-dependent radiance fields. Our method introduces a progressive multi-view update mechanism and neural directional embeddings, integrated within a Transformer architecture and trained via a coarse-to-fine strategy on a large-scale shape-material dataset. Quantitatively, it matches dense-view optimization methods in geometric and relighting accuracy while accelerating inference by two orders of magnitude (<1 s). Moreover, it natively supports integration with standard graphics engines and real-time relighting—significantly enhancing practicality and deployment potential.
📝 Abstract
We present Large Inverse Rendering Model (LIRM), a transformer architecture that jointly reconstructs high-quality shape, materials, and radiance fields with view-dependent effects in less than a second. Our model builds upon the recent Large Reconstruction Models (LRMs) that achieve state-of-the-art sparse-view reconstruction quality. However, existing LRMs struggle to reconstruct unseen parts accurately and cannot recover glossy appearance or generate relightable 3D contents that can be consumed by standard Graphics engines. To address these limitations, we make three key technical contributions to build a more practical multi-view 3D reconstruction framework. First, we introduce an update model that allows us to progressively add more input views to improve our reconstruction. Second, we propose a hexa-plane neural SDF representation to better recover detailed textures, geometry and material parameters. Third, we develop a novel neural directional-embedding mechanism to handle view-dependent effects. Trained on a large-scale shape and material dataset with a tailored coarse-to-fine training scheme, our model achieves compelling results. It compares favorably to optimization-based dense-view inverse rendering methods in terms of geometry and relighting accuracy, while requiring only a fraction of the inference time.