π€ AI Summary
This work addresses the research gap in multimodal VR sketch-driven 3D generation by proposing the first native VR hand-drawn sketch-guided multimodal 3D Gaussian Splatting (3DGS) generation framework. Methodologically: (1) it introduces a two-stage Sketch-CLIP feature alignment mechanism to enable cross-modal semantic coordination between VR sketches and text prompts; (2) it decouples geometric control (from sketches) from appearance control (from text), enhancing generation controllability; and (3) it constructs VRSSβthe first large-scale, four-modal paired dataset comprising VR sketches, text descriptions, rendered images, and corresponding 3DGS representations. Experiments on VRSS demonstrate significant improvements in geometric fidelity and texture quality, yielding high-fidelity, renderable 3DGS models. Moreover, the framework achieves superior inference efficiency compared to existing sketch-to-3D approaches.
π Abstract
We propose VRSketch2Gaussian, a first VR sketch-guided, multi-modal, native 3D object generation framework that incorporates a 3D Gaussian Splatting representation. As part of our work, we introduce VRSS, the first large-scale paired dataset containing VR sketches, text, images, and 3DGS, bridging the gap in multi-modal VR sketch-based generation. Our approach features the following key innovations: 1) Sketch-CLIP feature alignment. We propose a two-stage alignment strategy that bridges the domain gap between sparse VR sketch embeddings and rich CLIP embeddings, facilitating both VR sketch-based retrieval and generation tasks. 2) Fine-Grained multi-modal conditioning. We disentangle the 3D generation process by using explicit VR sketches for geometric conditioning and text descriptions for appearance control. To facilitate this, we propose a generalizable VR sketch encoder that effectively aligns different modalities. 3) Efficient and high-fidelity 3D native generation. Our method leverages a 3D-native generation approach that enables fast and texture-rich 3D object synthesis. Experiments conducted on our VRSS dataset demonstrate that our method achieves high-quality, multi-modal VR sketch-based 3D generation. We believe our VRSS dataset and VRsketch2Gaussian method will be beneficial for the 3D generation community.