VideoRFSplat: Direct Scene-Level Text-to-3D Gaussian Splatting Generation with Flexible Pose and Multi-View Joint Modeling

πŸ“… 2025-03-20
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the challenge of end-to-end, text-driven 3D Gaussian Splatting (3DGS) generation for unbounded real-world scenesβ€”where existing methods suffer from training instability and poor generalization due to joint modeling of camera poses and multi-view images. We propose the first dual-stream architecture that couples a pre-trained video diffusion model with a dedicated pose generator, augmented by an asynchronous denoising sampling strategy to mitigate cross-modal interference and pose-appearance ambiguity. A communication block bridges the two streams, and joint fine-tuning across multiple datasets (RealEstate10K, MVImgNet, DL3DV-10K, ACID) is performed without post-hoc techniques such as score distillation. Experiments demonstrate state-of-the-art performance in both 3D reconstruction quality and camera pose plausibility, significantly improving generalization capability and geometric fidelity for text-to-3DGS synthesis.

Technology Category

Application Category

πŸ“ Abstract
We propose VideoRFSplat, a direct text-to-3D model leveraging a video generation model to generate realistic 3D Gaussian Splatting (3DGS) for unbounded real-world scenes. To generate diverse camera poses and unbounded spatial extent of real-world scenes, while ensuring generalization to arbitrary text prompts, previous methods fine-tune 2D generative models to jointly model camera poses and multi-view images. However, these methods suffer from instability when extending 2D generative models to joint modeling due to the modality gap, which necessitates additional models to stabilize training and inference. In this work, we propose an architecture and a sampling strategy to jointly model multi-view images and camera poses when fine-tuning a video generation model. Our core idea is a dual-stream architecture that attaches a dedicated pose generation model alongside a pre-trained video generation model via communication blocks, generating multi-view images and camera poses through separate streams. This design reduces interference between the pose and image modalities. Additionally, we propose an asynchronous sampling strategy that denoises camera poses faster than multi-view images, allowing rapidly denoised poses to condition multi-view generation, reducing mutual ambiguity and enhancing cross-modal consistency. Trained on multiple large-scale real-world datasets (RealEstate10K, MVImgNet, DL3DV-10K, ACID), VideoRFSplat outperforms existing text-to-3D direct generation methods that heavily depend on post-hoc refinement via score distillation sampling, achieving superior results without such refinement.
Problem

Research questions and friction points this paper is trying to address.

Generates realistic 3D Gaussian Splatting from text prompts.
Models multi-view images and camera poses jointly.
Enhances cross-modal consistency without post-hoc refinement.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-stream architecture for multi-view modeling
Asynchronous sampling strategy for pose denoising
Direct text-to-3D generation without post-hoc refinement
πŸ”Ž Similar Papers
No similar papers found.