🤖 AI Summary
Existing text- or image-driven methods for 3D avatar generation face significant challenges in fine-grained control, inference efficiency, and the scarcity of high-quality 3D training data. To address these limitations, this work proposes PromptAvatar, a novel framework that leverages a large-scale, multimodal-aligned dataset comprising over 100,000 samples. The approach introduces a dual diffusion architecture with decoupled texture and geometry modules—namely, a Texture Diffusion Module (TDM) and a Geometry Diffusion Module (GDM)—enabling end-to-end, rapid generation of high-fidelity 3D avatars from text or image inputs. Notably, PromptAvatar operates without iterative optimization, supports flexible multimodal conditioning, and effectively mitigates illumination artifacts. Extensive experiments demonstrate that the method substantially outperforms existing approaches in terms of generation quality, detail alignment accuracy, and computational efficiency, producing highly detailed 3D avatars in under 10 seconds.
📝 Abstract
Generating high-fidelity 3D avatars from text or image prompts is highly sought after in virtual reality and human-computer interaction. However, existing text-driven methods often rely on iterative Score Distillation Sampling (SDS) or CLIP optimization, which struggle with fine-grained semantic control and suffer from excessively slow inference. Meanwhile, image-driven approaches are severely bottlenecked by the scarcity and high acquisition cost of high-quality 3D facial scans, limiting model generalization. To address these challenges, we first construct a novel, large-scale dataset comprising over 100,000 pairs across four modalities: fine-grained textual descriptions, in-the-wild face images, high-quality light-normalized texture UV maps, and 3D geometric shapes. Leveraging this comprehensive dataset, we propose PromptAvatar, a framework featuring dual diffusion models. Specifically, it integrates a Texture Diffusion Model (TDM) that supports flexible multi-condition guidance from text and/or image prompts, alongside a Geometry Diffusion Model (GDM) guided by text prompts. By learning the direct mapping from multi-modal prompts to 3D representations, PromptAvatar eliminates the need for time-consuming iterative optimization, successfully generating high-fidelity, shading-free 3D avatars in under 10 seconds. Extensive quantitative and qualitative experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches in generation quality, fine-grained detail alignment, and computational efficiency.