SAM 3D: 3Dfy Anything in Images

📅 2025-11-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the challenges of single-image 3D object reconstruction in natural scenes—specifically, inaccurate estimation of geometry, texture, and spatial layout due to occlusion and clutter—by proposing a vision-perception-driven generative reconstruction framework. Methodologically: (1) we design a human-in-the-loop annotation pipeline to construct the first large-scale, visually grounded real-world 3D dataset; (2) we adopt a multi-stage training paradigm combining synthetic-data pretraining with real-data alignment to alleviate the scarcity of 3D supervision; and (3) we integrate a context-aware module for joint estimation of shape, pose, and texture. Experiments demonstrate that our method achieves a ≥5:1 win rate over state-of-the-art methods in human preference evaluations on real-image reconstruction. To foster reproducibility and community advancement, we will publicly release the code, pretrained models, an interactive online demo, and the new benchmark dataset.

Technology Category

Application Category

📝 Abstract
We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D "data barrier". We obtain significant gains over recent work, with at least a 5:1 win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.
Problem

Research questions and friction points this paper is trying to address.

Reconstructing 3D objects from single images
Handling occlusion and clutter in natural scenes
Overcoming limited 3D training data availability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative model for single-image 3D reconstruction
Human- and model-in-the-loop annotation pipeline
Multi-stage training combining synthetic and real data
🔎 Similar Papers
No similar papers found.
S
SAM 3D Team
Meta Superintelligence Labs
X
Xingyu Chen
Meta Superintelligence Labs
Fu-Jen Chu
Fu-Jen Chu
Research Scientist @ Facebook AI Research
RoboticsComputer VisionDeep LearningMachine Learning
Pierre Gleize
Pierre Gleize
Meta Superintelligence Labs
Kevin J Liang
Kevin J Liang
Fundamental AI Research (FAIR) at Meta
Deep LearningComputer Vision
A
Alexander Sax
Meta Superintelligence Labs
H
Hao Tang
Meta Superintelligence Labs
W
Weiyao Wang
Meta Superintelligence Labs
Michelle Guo
Michelle Guo
Stanford University
T
Thibaut Hardin
Meta Superintelligence Labs
X
Xiang Li
Meta Superintelligence Labs
A
Aohan Lin
Meta Superintelligence Labs
J
Jiawei Liu
Meta Superintelligence Labs
Z
Ziqi Ma
Meta Superintelligence Labs
A
Anushka Sagar
Meta Superintelligence Labs
B
Bowen Song
Meta Superintelligence Labs
X
Xiaodong Wang
Meta Superintelligence Labs
J
Jianing Yang
Meta Superintelligence Labs
B
Bowen Zhang
Meta Superintelligence Labs
Piotr Dollár
Piotr Dollár
FAIR
computer visiondeep learningmachine learningartificial intelligence
Georgia Gkioxari
Georgia Gkioxari
Caltech, Meta AI
Computer VisionMachine LearningArtificial Intelligence
Matt Feiszli
Matt Feiszli
Facebook AI Research
Machine LearningComputer VisionHarmonic AnalysisGeometry
J
Jitendra Malik
Meta Superintelligence Labs