SmartAvatar: Text- and Image-Guided Human Avatar Generation with VLM AI Agents

πŸ“… 2025-06-05
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing diffusion-based methods for 3D portrait generation struggle to simultaneously ensure identity fidelity, controllable body morphology, and animation readiness. SmartAvatar addresses this by introducing an autonomous, iterative framework powered by a vision-language model (VLM) agent, capable of generating fully rigged, animatable 3D human avatars from a single image or text prompt. Its core innovation is a closed-loop verification mechanism: a VLM agent jointly evaluates facial similarity, anatomical plausibility, and prompt alignment, then issues natural-language instructions to guide parameter refinement. The method integrates parametric human models (e.g., SMPL-X), neural rendering, and large language model–driven procedural generation. Experiments demonstrate that SmartAvatar outperforms state-of-the-art methods across mesh quality, identity preservation, attribute accuracy, and animation usability. Moreover, it enables real-time customization on consumer-grade hardware.

Technology Category

Application Category

πŸ“ Abstract
SmartAvatar is a vision-language-agent-driven framework for generating fully rigged, animation-ready 3D human avatars from a single photo or textual prompt. While diffusion-based methods have made progress in general 3D object generation, they continue to struggle with precise control over human identity, body shape, and animation readiness. In contrast, SmartAvatar leverages the commonsense reasoning capabilities of large vision-language models (VLMs) in combination with off-the-shelf parametric human generators to deliver high-quality, customizable avatars. A key innovation is an autonomous verification loop, where the agent renders draft avatars, evaluates facial similarity, anatomical plausibility, and prompt alignment, and iteratively adjusts generation parameters for convergence. This interactive, AI-guided refinement process promotes fine-grained control over both facial and body features, enabling users to iteratively refine their avatars via natural-language conversations. Unlike diffusion models that rely on static pre-trained datasets and offer limited flexibility, SmartAvatar brings users into the modeling loop and ensures continuous improvement through an LLM-driven procedural generation and verification system. The generated avatars are fully rigged and support pose manipulation with consistent identity and appearance, making them suitable for downstream animation and interactive applications. Quantitative benchmarks and user studies demonstrate that SmartAvatar outperforms recent text- and image-driven avatar generation systems in terms of reconstructed mesh quality, identity fidelity, attribute accuracy, and animation readiness, making it a versatile tool for realistic, customizable avatar creation on consumer-grade hardware.
Problem

Research questions and friction points this paper is trying to address.

Generates 3D human avatars from text or single photo
Ensures precise control over identity and body shape
Provides animation-ready avatars with customizable features
Innovation

Methods, ideas, or system contributions that make the work stand out.

VLM AI agents guide avatar generation process
Autonomous verification loop ensures quality refinement
LLM-driven procedural generation enables customization
πŸ”Ž Similar Papers
No similar papers found.
A
Alexander Huang-Menders
Dartmouth College
Xinhang Liu
Xinhang Liu
HKUST
Computer Vision
A
Andy Xu
Dartmouth College
Yuyao Zhang
Yuyao Zhang
Renmin University of China
Artificial Intelligence
C
Chi-Keung Tang
The Hong Kong University of Science and Technology
Yu-Wing Tai
Yu-Wing Tai
Dartmouth College
Computer VisionDeep LearningMulti-modalities Generative AI