🤖 AI Summary
This work addresses the challenging problem of reconstructing animatable, identity-consistent, full-body 3D human avatars from a single input image. To tackle identity drift in diffusion-based video generation and ensure geometric fidelity, we propose a novel method integrating diffusion priors with geometry-aware 3D optimization. Specifically, we introduce a balanced sampling strategy to mitigate identity inconsistency across poses and a geometry-weighted optimization scheme that prioritizes surface deformation constraints. Our framework first leverages a diffusion model to synthesize a pose-diverse synthetic training video from the single image; this video then drives end-to-end 3D reconstruction using either Neural Radiance Fields (NeRF) or 3D Gaussian Splatting (3DGS). Experiments demonstrate that our approach achieves high-fidelity, cloth-aware dynamic geometry, photorealistic appearance, and strong cross-pose identity consistency—significantly outperforming existing single-image 3D human reconstruction methods.
📝 Abstract
Two major approaches exist for creating animatable human avatars. The first, a 3D-based approach, optimizes a NeRF- or 3DGS-based avatar from videos of a single person, achieving personalization through a disentangled identity representation. However, modeling pose-driven deformations, such as non-rigid cloth deformations, requires numerous pose-rich videos, which are costly and impractical to capture in daily life. The second, a diffusion-based approach, learns pose-driven deformations from large-scale in-the-wild videos but struggles with identity preservation and pose-dependent identity entanglement. We present PERSONA, a framework that combines the strengths of both approaches to obtain a personalized 3D human avatar with pose-driven deformations from a single image. PERSONA leverages a diffusion-based approach to generate pose-rich videos from the input image and optimizes a 3D avatar based on them. To ensure high authenticity and sharp renderings across diverse poses, we introduce balanced sampling and geometry-weighted optimization. Balanced sampling oversamples the input image to mitigate identity shifts in diffusion-generated training videos. Geometry-weighted optimization prioritizes geometry constraints over image loss, preserving rendering quality in diverse poses.