🤖 AI Summary
To address the challenges of modeling subtle facial expressions and skin details while maintaining real-time rendering efficiency in high-fidelity 3D head avatars, this paper proposes Patch-GS—a hierarchical framework integrating patch-wise expression modeling with 3D Gaussian Splatting. Its core innovation is the first construction of a patch-based expression latent space, replacing conventional global representations to enable precise, localized control of fine facial motions. Patch-GS further incorporates Scaffold-GS scene representation, patch-level geometric modeling, color-aware densification, and a progressive training strategy, enabling real-time (≥30 FPS) rendering at high resolution (3K). Experiments demonstrate significant improvements over state-of-the-art methods in reconstruction fidelity, motion naturalness, and training convergence speed. The method exhibits strong practicality for immersive telepresence and film production applications.
📝 Abstract
Generating high-fidelity real-time animated sequences of photorealistic 3D head avatars is important for many graphics applications, including immersive telepresence and movies. This is a challenging problem particularly when rendering digital avatar close-ups for showing character's facial microfeatures and expressions. To capture the expressive, detailed nature of human heads, including skin furrowing and finer-scale facial movements, we propose to couple locally-defined facial expressions with 3D Gaussian splatting to enable creating ultra-high fidelity, expressive and photorealistic 3D head avatars. In contrast to previous works that operate on a global expression space, we condition our avatar's dynamics on patch-based local expression features and synthesize 3D Gaussians at a patch level. In particular, we leverage a patch-based geometric 3D face model to extract patch expressions and learn how to translate these into local dynamic skin appearance and motion by coupling the patches with anchor points of Scaffold-GS, a recent hierarchical scene representation. These anchors are then used to synthesize 3D Gaussians on-the-fly, conditioned by patch-expressions and viewing direction. We employ color-based densification and progressive training to obtain high-quality results and faster convergence for high resolution 3K training images. By leveraging patch-level expressions, ScaffoldAvatar consistently achieves state-of-the-art performance with visually natural motion, while encompassing diverse facial expressions and styles in real time.