Ref-SAM3D: Bridging SAM3D with Text for Reference 3D Reconstruction

📅 2025-11-24

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

While SAM3D excels at 3D reconstruction from single-view RGB images, it lacks text-driven semantic controllability, limiting its applicability in 3D editing and virtual environment generation. To address this, we propose the first zero-shot, text-guided 3D reconstruction framework tailored for SAM3D. Our method bridges cross-modal alignment between a single RGB image and natural language descriptions by coupling a CLIP text encoder with SAM3D’s 3D decoder—without requiring any text–3D paired training data. It preserves SAM3D’s original single-view reconstruction capability while injecting linguistic priors to precisely steer both semantic category and geometric structure of the reconstructed 3D output. Extensive experiments demonstrate that our approach significantly improves semantic fidelity and fine-grained geometric detail recovery under zero-shot settings, effectively narrowing the semantic gap between 2D visual understanding and 3D geometric generation.

Technology Category

Application Category

📝 Abstract

SAM3D has garnered widespread attention for its strong 3D object reconstruction capabilities. However, a key limitation remains: SAM3D cannot reconstruct specific objects referred to by textual descriptions, a capability that is essential for practical applications such as 3D editing, game development, and virtual environments. To address this gap, we introduce Ref-SAM3D, a simple yet effective extension to SAM3D that incorporates textual descriptions as a high-level prior, enabling text-guided 3D reconstruction from a single RGB image. Through extensive qualitative experiments, we show that Ref-SAM3D, guided only by natural language and a single 2D view, delivers competitive and high-fidelity zero-shot reconstruction performance. Our results demonstrate that Ref-SAM3D effectively bridges the gap between 2D visual cues and 3D geometric understanding, offering a more flexible and accessible paradigm for reference-guided 3D reconstruction. Code is available at: https://github.com/FudanCVL/Ref-SAM3D.

Problem

Research questions and friction points this paper is trying to address.

Enabling text-guided 3D reconstruction from single RGB images

Bridging SAM3D's limitation with textual object reference capabilities

Providing flexible reference-guided 3D reconstruction for practical applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extends SAM3D with text descriptions as prior

Enables text-guided 3D reconstruction from single image

Bridges 2D visual cues with 3D geometric understanding

🔎 Similar Papers

GTR: Improving Large 3D Reconstruction Models through Geometry and Texture Refinement