🤖 AI Summary
This work addresses the challenge of jointly reconstructing scene illumination, object geometry, and material appearance from an extremely limited number of multi-view images—enabling high-fidelity digital twin modeling. To overcome key bottlenecks in data efficiency, memory footprint, and real-domain generalization inherent in existing approaches, we propose: (1) a memory-efficient voxel-grid Transformer with quadratic complexity in resolution; (2) a large-scale procedural PBR synthetic dataset for robust pretraining; and (3) differentiable physically based rendering (PBR) supervision enabling ground-truth-free training and synthetic-to-real domain adaptation. Evaluated on the StanfordORB real-world benchmark, our method achieves superior reconstruction quality using only 3–5 input views—outperforming feedforward baselines and matching the fidelity of slow, per-scene optimization methods. Our approach significantly advances few-shot 3D perception and material-aware reconstruction capabilities.
📝 Abstract
We present the first large reconstruction model, Twinner, capable of recovering a scene's illumination as well as an object's geometry and material properties from only a few posed images. Twinner is based on the Large Reconstruction Model and innovates in three key ways: 1) We introduce a memory-efficient voxel-grid transformer whose memory scales only quadratically with the size of the voxel grid. 2) To deal with scarcity of high-quality ground-truth PBR-shaded models, we introduce a large fully-synthetic dataset of procedurally-generated PBR-textured objects lit with varied illumination. 3) To narrow the synthetic-to-real gap, we finetune the model on real life datasets by means of a differentiable physically-based shading model, eschewing the need for ground-truth illumination or material properties which are challenging to obtain in real life. We demonstrate the efficacy of our model on the real life StanfordORB benchmark where, given few input views, we achieve reconstruction quality significantly superior to existing feedforward reconstruction networks, and comparable to significantly slower per-scene optimization methods.