Ref-SAM3D: Bridging SAM3D with Text for Reference 3D Reconstruction

📅 2025-11-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
While SAM3D excels at 3D reconstruction from single-view RGB images, it lacks text-driven semantic controllability, limiting its applicability in 3D editing and virtual environment generation. To address this, we propose the first zero-shot, text-guided 3D reconstruction framework tailored for SAM3D. Our method bridges cross-modal alignment between a single RGB image and natural language descriptions by coupling a CLIP text encoder with SAM3D’s 3D decoder—without requiring any text–3D paired training data. It preserves SAM3D’s original single-view reconstruction capability while injecting linguistic priors to precisely steer both semantic category and geometric structure of the reconstructed 3D output. Extensive experiments demonstrate that our approach significantly improves semantic fidelity and fine-grained geometric detail recovery under zero-shot settings, effectively narrowing the semantic gap between 2D visual understanding and 3D geometric generation.

Technology Category

Application Category

📝 Abstract
SAM3D has garnered widespread attention for its strong 3D object reconstruction capabilities. However, a key limitation remains: SAM3D cannot reconstruct specific objects referred to by textual descriptions, a capability that is essential for practical applications such as 3D editing, game development, and virtual environments. To address this gap, we introduce Ref-SAM3D, a simple yet effective extension to SAM3D that incorporates textual descriptions as a high-level prior, enabling text-guided 3D reconstruction from a single RGB image. Through extensive qualitative experiments, we show that Ref-SAM3D, guided only by natural language and a single 2D view, delivers competitive and high-fidelity zero-shot reconstruction performance. Our results demonstrate that Ref-SAM3D effectively bridges the gap between 2D visual cues and 3D geometric understanding, offering a more flexible and accessible paradigm for reference-guided 3D reconstruction. Code is available at: https://github.com/FudanCVL/Ref-SAM3D.
Problem

Research questions and friction points this paper is trying to address.

Enabling text-guided 3D reconstruction from single RGB images
Bridging SAM3D's limitation with textual object reference capabilities
Providing flexible reference-guided 3D reconstruction for practical applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends SAM3D with text descriptions as prior
Enables text-guided 3D reconstruction from single image
Bridges 2D visual cues with 3D geometric understanding
🔎 Similar Papers
No similar papers found.
Y
Yun Zhou
Institute of Big Data, College of Computer Science and Artificial Intelligence, Fudan University
Yaoting Wang
Yaoting Wang
Fudan University
Multimodal LLMOmni MLLMAudio-Visual LearningSegmentation
G
Guangquan Jie
Institute of Big Data, College of Computer Science and Artificial Intelligence, Fudan University
J
Jinyu Liu
Institute of Big Data, College of Computer Science and Artificial Intelligence, Fudan University
Henghui Ding
Henghui Ding
Fudan University
Computer VisionMachine LearningSegmentationAIGC