🤖 AI Summary
Reconstructing complete and animatable 3D human avatars from monocular videos remains challenging under severe occlusions, as existing methods often suffer from geometric artifacts and temporal inconsistencies due to missing observations. This work proposes a novel approach that leverages multi-scale UV parameterization for robust geometry reconstruction and introduces an identity-preserving diffusion inpainting module, which integrates textual inversion and semantic guidance to effectively recover occluded regions while maintaining subject-specific details and temporal coherence. By employing hierarchical feature interpolation and pixel-level direct supervision, the method achieves significantly improved reconstruction quality on the PeopleSnapshot, ZJU-MoCap, and OcMotion datasets, yielding more complete geometries and temporally stable animations.
📝 Abstract
Reconstructing complete and animatable 3D human avatars from monocular videos remains challenging, particularly under severe occlusions. While 3D Gaussian Splatting has enabled photorealistic human rendering, existing methods struggle with incomplete observations, often producing corrupted geometry and temporal inconsistencies. We present InpaintHuman, a novel method for generating high-fidelity, complete, and animatable avatars from occluded monocular videos. Our approach introduces two key innovations: (i) a multi-scale UV-parameterized representation with hierarchical coarse-to-fine feature interpolation, enabling robust reconstruction of occluded regions while preserving geometric details; and (ii) an identity-preserving diffusion inpainting module that integrates textual inversion with semantic-conditioned guidance for subject-specific, temporally coherent completion. Unlike SDS-based methods, our approach employs direct pixel-level supervision to ensure identity fidelity. Experiments on synthetic benchmarks (PeopleSnapshot, ZJU-MoCap) and real-world scenarios (OcMotion) demonstrate competitive performance with consistent improvements in reconstruction quality across diverse poses and viewpoints.