🤖 AI Summary
While SAM3D excels at 3D reconstruction from single-view RGB images, it lacks text-driven semantic controllability, limiting its applicability in 3D editing and virtual environment generation. To address this, we propose the first zero-shot, text-guided 3D reconstruction framework tailored for SAM3D. Our method bridges cross-modal alignment between a single RGB image and natural language descriptions by coupling a CLIP text encoder with SAM3D’s 3D decoder—without requiring any text–3D paired training data. It preserves SAM3D’s original single-view reconstruction capability while injecting linguistic priors to precisely steer both semantic category and geometric structure of the reconstructed 3D output. Extensive experiments demonstrate that our approach significantly improves semantic fidelity and fine-grained geometric detail recovery under zero-shot settings, effectively narrowing the semantic gap between 2D visual understanding and 3D geometric generation.
📝 Abstract
SAM3D has garnered widespread attention for its strong 3D object reconstruction capabilities. However, a key limitation remains: SAM3D cannot reconstruct specific objects referred to by textual descriptions, a capability that is essential for practical applications such as 3D editing, game development, and virtual environments. To address this gap, we introduce Ref-SAM3D, a simple yet effective extension to SAM3D that incorporates textual descriptions as a high-level prior, enabling text-guided 3D reconstruction from a single RGB image. Through extensive qualitative experiments, we show that Ref-SAM3D, guided only by natural language and a single 2D view, delivers competitive and high-fidelity zero-shot reconstruction performance. Our results demonstrate that Ref-SAM3D effectively bridges the gap between 2D visual cues and 3D geometric understanding, offering a more flexible and accessible paradigm for reference-guided 3D reconstruction. Code is available at: https://github.com/FudanCVL/Ref-SAM3D.