🤖 AI Summary
Existing methods for full-body 3D human reconstruction from a single RGB image suffer from geometric distortion, identity drift, and facial deformation under complex clothing topology and self-occlusion. This paper proposes a cross-scale multi-view diffusion framework addressing these challenges. First, it introduces a novel diffusion mechanism jointly modeling global body deformation and local facial details. Second, it incorporates an SMPL-X parametric prior to enforce anatomical plausibility. Third, it designs an explicit mesh carving strategy initialized from SMPL-X, ensuring geometrically consistent and high-fidelity surface reconstruction. The framework supports multi-view synthesis of surface normals and color, and enables high-quality texture mapping. Evaluated on CAPE and THuman2.1, our method significantly improves geometric fidelity, texture realism, and cross-pose generalization, while effectively eliminating facial distortions and preserving subject identity in high-fidelity 3D reconstructions.
📝 Abstract
Detailed and photorealistic 3D human modeling is essential for various applications and has seen tremendous progress. However, full-body reconstruction from a monocular RGB image remains challenging due to the ill-posed nature of the problem and sophisticated clothing topology with self-occlusions. In this paper, we propose PSHuman, a novel framework that explicitly reconstructs human meshes utilizing priors from the multiview diffusion model. It is found that directly applying multiview diffusion on single-view human images leads to severe geometric distortions, especially on generated faces. To address it, we propose a cross-scale diffusion that models the joint probability distribution of global full-body shape and local facial characteristics, enabling detailed and identity-preserved novel-view generation without any geometric distortion. Moreover, to enhance cross-view body shape consistency of varied human poses, we condition the generative model on parametric models like SMPL-X, which provide body priors and prevent unnatural views inconsistent with human anatomy. Leveraging the generated multi-view normal and color images, we present SMPLX-initialized explicit human carving to recover realistic textured human meshes efficiently. Extensive experimental results and quantitative evaluations on CAPE and THuman2.1 datasets demonstrate PSHumans superiority in geometry details, texture fidelity, and generalization capability.