Text-based Animatable 3D Avatars with Morphable Model Alignment

📅 2025-04-22

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

Existing text-to-3D avatar generation methods face two key challenges: (1) insufficient geometric and appearance constraints from text prompts, leading to structural and textural distortions; and (2) misalignment between 2D diffusion models and parametric head models (e.g., FLAME), causing animation artifacts. This work proposes the first framework integrating pretrained text-to-3D priors with ControlNet-guided semantic alignment, jointly supervised by normal maps and semantic segmentation maps to ensure consistency across geometry, texture, and dynamic facial expressions. Our method synergistically combines 3D Gaussian splatting, text-to-3D diffusion priors, ControlNet-based conditional control, and the FLAME morphable model. Experiments demonstrate significant improvements over state-of-the-art methods in synthesis quality, 3D model alignment accuracy, and animation fidelity—enabling high-fidelity texture reconstruction and natural, expression-driven animation.

Technology Category

Application Category

📝 Abstract

The generation of high-quality, animatable 3D head avatars from text has enormous potential in content creation applications such as games, movies, and embodied virtual assistants. Current text-to-3D generation methods typically combine parametric head models with 2D diffusion models using score distillation sampling to produce 3D-consistent results. However, they struggle to synthesize realistic details and suffer from misalignments between the appearance and the driving parametric model, resulting in unnatural animation results. We discovered that these limitations stem from ambiguities in the 2D diffusion predictions during 3D avatar distillation, specifically: i) the avatar's appearance and geometry is underconstrained by the text input, and ii) the semantic alignment between the predictions and the parametric head model is insufficient because the diffusion model alone cannot incorporate information from the parametric model. In this work, we propose a novel framework, AnimPortrait3D, for text-based realistic animatable 3DGS avatar generation with morphable model alignment, and introduce two key strategies to address these challenges. First, we tackle appearance and geometry ambiguities by utilizing prior information from a pretrained text-to-3D model to initialize a 3D avatar with robust appearance, geometry, and rigging relationships to the morphable model. Second, we refine the initial 3D avatar for dynamic expressions using a ControlNet that is conditioned on semantic and normal maps of the morphable model to ensure accurate alignment. As a result, our method outperforms existing approaches in terms of synthesis quality, alignment, and animation fidelity. Our experiments show that the proposed method advances the state of the art in text-based, animatable 3D head avatar generation.

Problem

Research questions and friction points this paper is trying to address.

Generates realistic animatable 3D avatars from text

Aligns avatar appearance with parametric morphable models

Improves animation fidelity and detail synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Initializes avatar using pretrained text-to-3D model

Refines avatar with ControlNet for dynamic expressions

Ensures alignment via semantic and normal maps

🔎 Similar Papers

No similar papers found.