SplatFlow: Multi-View Rectified Flow Model for 3D Gaussian Splatting Synthesis

📅 2024-11-25
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D scene generation and editing methods lack a unified framework, failing to simultaneously support end-to-end, text-driven 3D Gaussian Splatting (3DGS) generation and editing, while exhibiting degraded performance on multi-scale scenes and complex camera trajectories. To address this, we propose the first integrated, text-driven 3DGS generation and editing framework. Our method introduces a multi-view calibrated flow model that jointly predicts RGB images, depth maps, and camera poses; a training-free 3DGS decoder enabling direct text-to-3DGS mapping; and a training-free inversion plus mask-based editing mechanism for zero-shot, real-time 3D editing. We validate our approach on MVImgNet and DL3DV-7K, demonstrating high-fidelity novel-view synthesis, precise object-level editing, and accurate camera pose estimation. Experimental results show substantial improvements in both efficiency and flexibility for 3D content creation.

Technology Category

Application Category

📝 Abstract
Text-based generation and editing of 3D scenes hold significant potential for streamlining content creation through intuitive user interactions. While recent advances leverage 3D Gaussian Splatting (3DGS) for high-fidelity and real-time rendering, existing methods are often specialized and task-focused, lacking a unified framework for both generation and editing. In this paper, we introduce SplatFlow, a comprehensive framework that addresses this gap by enabling direct 3DGS generation and editing. SplatFlow comprises two main components: a multi-view rectified flow (RF) model and a Gaussian Splatting Decoder (GSDecoder). The multi-view RF model operates in latent space, generating multi-view images, depths, and camera poses simultaneously, conditioned on text prompts, thus addressing challenges like diverse scene scales and complex camera trajectories in real-world settings. Then, the GSDecoder efficiently translates these latent outputs into 3DGS representations through a feed-forward 3DGS method. Leveraging training-free inversion and inpainting techniques, SplatFlow enables seamless 3DGS editing and supports a broad range of 3D tasks-including object editing, novel view synthesis, and camera pose estimation-within a unified framework without requiring additional complex pipelines. We validate SplatFlow's capabilities on the MVImgNet and DL3DV-7K datasets, demonstrating its versatility and effectiveness in various 3D generation, editing, and inpainting-based tasks.
Problem

Research questions and friction points this paper is trying to address.

Unified framework for 3D Gaussian Splatting generation and editing
Handling diverse scene scales and complex camera trajectories
Enabling seamless 3DGS editing and multi-task support
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-view RF model generates images and poses
GSDecoder converts latent outputs to 3DGS
Training-free inversion enables seamless 3D editing
🔎 Similar Papers
No similar papers found.