🤖 AI Summary
This work addresses the problem of reconstructing animatable 3D human avatars from a single or sparse set of images, without pose priors. We propose an end-to-end neural implicit framework that requires no pose input at test time. Our method eliminates reliance on accurate pose estimation or camera parameters, jointly modeling implicit geometry and view-dependent appearance, with images as the sole supervision signal for both training and inference. The key contribution is the decoupling of pose learning from deformation modeling, thereby preventing pose estimation noise from propagating into geometry reconstruction and significantly improving robustness in real-world scenarios. Extensive experiments on THuman2.0, XHuman, and HuGe100K demonstrate that our approach substantially outperforms existing methods under pose-free conditions, while achieving comparable performance when ground-truth poses are available—validating its generality and effectiveness.
📝 Abstract
We tackle the task of recovering an animatable 3D human avatar from a single or a sparse set of images. For this task, beyond a set of images, many prior state-of-the-art methods use accurate "ground-truth" camera poses and human poses as input to guide reconstruction at test-time. We show that pose-dependent reconstruction degrades results significantly if pose estimates are noisy. To overcome this, we introduce NoPo-Avatar, which reconstructs avatars solely from images, without any pose input. By removing the dependence of test-time reconstruction on human poses, NoPo-Avatar is not affected by noisy human pose estimates, making it more widely applicable. Experiments on challenging THuman2.0, XHuman, and HuGe100K data show that NoPo-Avatar outperforms existing baselines in practical settings (without ground-truth poses) and delivers comparable results in lab settings (with ground-truth poses).