Progressive Rendering Distillation: Adapting Stable Diffusion for Instant Text-to-Mesh Generation without 3D Data

📅 2025-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Addressing the challenge of text-to-3D generation hindered by scarce high-quality 3D supervision, this paper proposes a progressive multi-view diffusion distillation framework that requires no 3D ground truth. Methodologically, it adapts Stable Diffusion natively into a 3D generator: multi-view score distillation—integrating techniques from MVDream and RichDreamer—is performed in the latent space via progressive denoising, coupled with real-time triplane decoding. We further design TriplaneTurbo, a lightweight triplane architecture that introduces only 2.5% additional parameters while enabling efficient triplane representation. Experiments demonstrate that our method generates high-fidelity 3D meshes in just 1.2 seconds—surpassing state-of-the-art unsupervised approaches in both efficiency and quality—and exhibits strong generalization to creative, compositional text prompts.

Technology Category

Application Category

📝 Abstract
It is highly desirable to obtain a model that can generate high-quality 3D meshes from text prompts in just seconds. While recent attempts have adapted pre-trained text-to-image diffusion models, such as Stable Diffusion (SD), into generators of 3D representations (e.g., Triplane), they often suffer from poor quality due to the lack of sufficient high-quality 3D training data. Aiming at overcoming the data shortage, we propose a novel training scheme, termed as Progressive Rendering Distillation (PRD), eliminating the need for 3D ground-truths by distilling multi-view diffusion models and adapting SD into a native 3D generator. In each iteration of training, PRD uses the U-Net to progressively denoise the latent from random noise for a few steps, and in each step it decodes the denoised latent into 3D output. Multi-view diffusion models, including MVDream and RichDreamer, are used in joint with SD to distill text-consistent textures and geometries into the 3D outputs through score distillation. Since PRD supports training without 3D ground-truths, we can easily scale up the training data and improve generation quality for challenging text prompts with creative concepts. Meanwhile, PRD can accelerate the inference speed of the generation model in just a few steps. With PRD, we train a Triplane generator, namely TriplaneTurbo, which adds only $2.5%$ trainable parameters to adapt SD for Triplane generation. TriplaneTurbo outperforms previous text-to-3D generators in both efficiency and quality. Specifically, it can produce high-quality 3D meshes in 1.2 seconds and generalize well for challenging text input. The code is available at https://github.com/theEricMa/TriplaneTurbo.
Problem

Research questions and friction points this paper is trying to address.

Generate 3D meshes from text prompts quickly
Overcome lack of high-quality 3D training data
Improve efficiency and quality of text-to-3D generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Progressive Rendering Distillation for 3D generation
Distills multi-view diffusion models without 3D data
Adapts Stable Diffusion into native 3D generator
🔎 Similar Papers
No similar papers found.
Z
Zhiyuan Ma
The Hong Kong Polytechnic University, Center for Artificial Intelligence and Robotics, HKISI CAS
Xinyue Liang
Xinyue Liang
PhD student of KTH Royal Institute of Technology
Machine learningDistributed learningNeural networks
Rongyuan Wu
Rongyuan Wu
The Hong Kong Polytechnic University
Computational PhotographyGenerative Models
X
Xiangyu Zhu
State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA, School of Artificial Intelligence, University of Chinese Academy of Sciences, UCAS
Zhen Lei
Zhen Lei
Associate Professor, OSCO Research Chair in Off-site Construction
Offsite ConstructionConstruction Engineering and Management
L
Lei Zhang
The Hong Kong Polytechnic University