🤖 AI Summary
This paper addresses the challenges of single-image 3D object reconstruction in natural scenes—specifically, inaccurate estimation of geometry, texture, and spatial layout due to occlusion and clutter—by proposing a vision-perception-driven generative reconstruction framework. Methodologically: (1) we design a human-in-the-loop annotation pipeline to construct the first large-scale, visually grounded real-world 3D dataset; (2) we adopt a multi-stage training paradigm combining synthetic-data pretraining with real-data alignment to alleviate the scarcity of 3D supervision; and (3) we integrate a context-aware module for joint estimation of shape, pose, and texture. Experiments demonstrate that our method achieves a ≥5:1 win rate over state-of-the-art methods in human preference evaluations on real-image reconstruction. To foster reproducibility and community advancement, we will publicly release the code, pretrained models, an interactive online demo, and the new benchmark dataset.
📝 Abstract
We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D "data barrier". We obtain significant gains over recent work, with at least a 5:1 win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.