TeRA: Rethinking Text-driven Realistic 3D Avatar Generation

📅 2025-09-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of slow iterative optimization, difficulty in localized customization, and insufficient realism in text-driven 3D portrait generation, this paper proposes TeRA—a two-stage text-to-3D portrait generation framework. First, knowledge distillation compresses a large-scale human reconstruction model’s decoder into a structured latent space that encodes geometric priors. Second, a text-conditioned latent diffusion model is trained within this space, enabling end-to-end, non-iterative 3D generation. The key contribution lies in introducing a structured 3D human representation, which supports fine-grained, text-guided local editing (e.g., “wearing a red jacket” or “wearing glasses”) while avoiding the inefficiency of conventional score-distillation sampling (SDS)-based optimization. Experiments demonstrate that TeRA significantly outperforms existing methods in generation quality, photorealism, and inference speed, achieving state-of-the-art performance in both objective and subjective evaluations.

Technology Category

Application Category

📝 Abstract
In this paper, we rethink text-to-avatar generative models by proposing TeRA, a more efficient and effective framework than the previous SDS-based models and general large 3D generative models.Our approach employs a two-stage training strategy for learning a native 3D avatar generative model. Initially, we distill a decoder to derive a structured latent space from a large human reconstruction model. Subsequently, a text-controlled latent diffusion model is trained to generate photorealistic 3D human avatars within this latent space. TeRA enhances the model performance by eliminating slow iterative optimization and enables text-based partial customization through a structured 3D human representation.Experiments have proven our approach's superiority over previous text-to-avatar generative models in subjective and objective evaluation.
Problem

Research questions and friction points this paper is trying to address.

Generating realistic 3D human avatars from text descriptions
Overcoming inefficiencies in previous SDS-based 3D generation models
Enabling text-based partial customization of 3D human representations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage training strategy for 3D avatar generation
Latent diffusion model in structured space
Eliminates slow iterative optimization process
🔎 Similar Papers
No similar papers found.