🤖 AI Summary
This work addresses two key bottlenecks in monocular video-driven 3D Gaussian Splatting (3DGS) for human avatars: time-consuming per-subject optimization and poor generalization under sparse input views. We propose the first general-purpose 3DGS framework for high-fidelity, animatable human reconstruction. Our method introduces: (1) a UV-aligned implicit identity map for spatially consistent identity encoding; (2) a decoupled multi-head U-Net architecture that explicitly models and jointly optimizes geometry, appearance, pose, and viewpoint attributes; and (3) integration of parametric human priors with UV-space feature encoding to enhance structural robustness under sparse-view conditions. The framework reconstructs a single subject in approximately 20 minutes—significantly faster than full optimization—while achieving comparable visual fidelity. It maintains stable rendering quality even under large pose variations and extreme camera viewpoints.
📝 Abstract
Photorealistic and animatable human avatars are a key enabler for virtual/augmented reality, telepresence, and digital entertainment. While recent advances in 3D Gaussian Splatting (3DGS) have greatly improved rendering quality and efficiency, existing methods still face fundamental challenges, including time-consuming per-subject optimization and poor generalization under sparse monocular inputs. In this work, we present the Parametric Gaussian Human Model (PGHM), a generalizable and efficient framework that integrates human priors into 3DGS for fast and high-fidelity avatar reconstruction from monocular videos. PGHM introduces two core components: (1) a UV-aligned latent identity map that compactly encodes subject-specific geometry and appearance into a learnable feature tensor; and (2) a disentangled Multi-Head U-Net that predicts Gaussian attributes by decomposing static, pose-dependent, and view-dependent components via conditioned decoders. This design enables robust rendering quality under challenging poses and viewpoints, while allowing efficient subject adaptation without requiring multi-view capture or long optimization time. Experiments show that PGHM is significantly more efficient than optimization-from-scratch methods, requiring only approximately 20 minutes per subject to produce avatars with comparable visual quality, thereby demonstrating its practical applicability for real-world monocular avatar creation.