🤖 AI Summary
This paper addresses the challenging problem of generating animatable 3D human models from a single input image. Methodologically, it introduces the first single-view human diffusion framework integrated with generative priors. Specifically: (1) it pioneers the transfer of pre-trained diffusion priors to human modeling, significantly improving geometric and textural fidelity; (2) it incorporates a Human NeRF module that jointly encodes camera pose and human pose transformations in an implicit manner, ensuring view- and pose-consistency; and (3) it proposes an image-level cross-space loss to bridge the diffusion latent space and pixel space. Evaluated on RenderPeople and DNA-Rendering benchmarks, the method achieves state-of-the-art performance in novel-view synthesis and novel-pose retargeting—demonstrating superior perceptual quality, generalization capability, and fine-grained detail consistency. These results validate the effectiveness of synergistically combining generative priors with neural radiance fields for monocular 3D human reconstruction.
📝 Abstract
While previous single-view-based 3D human reconstruction methods made significant progress in novel view synthesis, it remains a challenge to synthesize both view-consistent and pose-consistent results for animatable human avatars from a single image input. Motivated by the success of 2D character animation, we proposeHumanGif, a single-view human diffusion model with generative prior. Specifically, we formulate the single-view-based 3D human novel view and pose synthesis as a single-view-conditioned human diffusion process, utilizing generative priors from foundational diffusion models. To ensure fine-grained and consistent novel view and pose synthesis, we introduce a Human NeRF module in HumanGif to learn spatially aligned features from the input image, implicitly capturing the relative camera and human pose transformation. Furthermore, we introduce an image-level loss during optimization to bridge the gap between latent and image spaces in diffusion models. Extensive experiments on RenderPeople and DNA-Rendering datasets demonstrate that HumanGif achieves the best perceptual performance, with better generalizability for novel view and pose synthesis.