🤖 AI Summary
This paper addresses the challenge of reconstructing high-fidelity, identity-consistent 3D avatars from unconstrained mobile phone photographs—where existing methods suffer from geometric inconsistency, identity degradation, and loss of high-frequency details (e.g., wrinkles, fine hair). We propose a novel “Capture–Normalize–Splat” paradigm: (1) a generative normalization module maps arbitrary-view mobile images to a canonical pose without explicit pose annotations; (2) a Transformer-based 3D Gaussian splatting network is trained end-to-end on a large-scale real-person dome-captured dataset. Our method requires no multi-view registration or pose supervision, enabling zero-shot generation of three-quarter-body 3D Gaussian avatars. It significantly improves geometric consistency, identity fidelity, and realism of high-frequency surface details—including skin texture and hair—while maintaining robust visual realism and identity stability under uncontrolled capture conditions.
📝 Abstract
We present a novel, zero-shot pipeline for creating hyperrealistic, identity-preserving 3D avatars from a few unstructured phone images. Existing methods face several challenges: single-view approaches suffer from geometric inconsistencies and hallucinations, degrading identity preservation, while models trained on synthetic data fail to capture high-frequency details like skin wrinkles and fine hair, limiting realism. Our method introduces two key contributions: (1) a generative canonicalization module that processes multiple unstructured views into a standardized, consistent representation, and (2) a transformer-based model trained on a new, large-scale dataset of high-fidelity Gaussian splatting avatars derived from dome captures of real people. This "Capture, Canonicalize, Splat" pipeline produces static quarter-body avatars with compelling realism and robust identity preservation from unstructured photos.