FlashWorld: High-quality 3D Scene Generation within Seconds

📅 2025-10-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D scene generation methods suffer from slow inference and low rendering quality. To address these limitations, we propose the first 3D-aware diffusion-based generative framework, leveraging dual-mode (single-image or text) pretraining and cross-modal distillation to jointly optimize multiview visual fidelity and 3D geometric consistency. Our method builds upon a video diffusion prior and introduces a dual-path multiview diffusion architecture, enhanced by large-scale single-view image and text-prompt data to improve generalization. Experiments demonstrate that our approach achieves significantly improved visual quality while maintaining high 3D consistency; inference runs at second-level speed—10–100× faster than state-of-the-art methods; and supports flexible input modalities (single image or text), enabling high-fidelity, efficient 3D reconstruction across diverse scenes.

Technology Category

Application Category

📝 Abstract
We propose FlashWorld, a generative model that produces 3D scenes from a single image or text prompt in seconds, 10~100$ imes$ faster than previous works while possessing superior rendering quality. Our approach shifts from the conventional multi-view-oriented (MV-oriented) paradigm, which generates multi-view images for subsequent 3D reconstruction, to a 3D-oriented approach where the model directly produces 3D Gaussian representations during multi-view generation. While ensuring 3D consistency, 3D-oriented method typically suffers poor visual quality. FlashWorld includes a dual-mode pre-training phase followed by a cross-mode post-training phase, effectively integrating the strengths of both paradigms. Specifically, leveraging the prior from a video diffusion model, we first pre-train a dual-mode multi-view diffusion model, which jointly supports MV-oriented and 3D-oriented generation modes. To bridge the quality gap in 3D-oriented generation, we further propose a cross-mode post-training distillation by matching distribution from consistent 3D-oriented mode to high-quality MV-oriented mode. This not only enhances visual quality while maintaining 3D consistency, but also reduces the required denoising steps for inference. Also, we propose a strategy to leverage massive single-view images and text prompts during this process to enhance the model's generalization to out-of-distribution inputs. Extensive experiments demonstrate the superiority and efficiency of our method.
Problem

Research questions and friction points this paper is trying to address.

Generates 3D scenes from single images or text prompts rapidly
Improves 3D consistency while maintaining high visual quality
Enhances generalization for out-of-distribution inputs using cross-mode training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Direct 3D Gaussian generation from single input
Dual-mode pre-training with video diffusion prior
Cross-mode distillation for quality and consistency
🔎 Similar Papers
No similar papers found.
X
Xinyang Li
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University
T
Tengfei Wang
Tencent
Z
Zixiao Gu
Yes Lab, Fudan University
Shengchuan Zhang
Shengchuan Zhang
Xiamen University
computer visionmachine learning
C
Chunchao Guo
Tencent
L
Liujuan Cao
Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University