HumanGif: Single-View Human Diffusion with Generative Prior

📅 2025-02-17

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This paper addresses the challenging problem of generating animatable 3D human models from a single input image. Methodologically, it introduces the first single-view human diffusion framework integrated with generative priors. Specifically: (1) it pioneers the transfer of pre-trained diffusion priors to human modeling, significantly improving geometric and textural fidelity; (2) it incorporates a Human NeRF module that jointly encodes camera pose and human pose transformations in an implicit manner, ensuring view- and pose-consistency; and (3) it proposes an image-level cross-space loss to bridge the diffusion latent space and pixel space. Evaluated on RenderPeople and DNA-Rendering benchmarks, the method achieves state-of-the-art performance in novel-view synthesis and novel-pose retargeting—demonstrating superior perceptual quality, generalization capability, and fine-grained detail consistency. These results validate the effectiveness of synergistically combining generative priors with neural radiance fields for monocular 3D human reconstruction.

Technology Category

Application Category

📝 Abstract

While previous single-view-based 3D human reconstruction methods made significant progress in novel view synthesis, it remains a challenge to synthesize both view-consistent and pose-consistent results for animatable human avatars from a single image input. Motivated by the success of 2D character animation, we proposeHumanGif, a single-view human diffusion model with generative prior. Specifically, we formulate the single-view-based 3D human novel view and pose synthesis as a single-view-conditioned human diffusion process, utilizing generative priors from foundational diffusion models. To ensure fine-grained and consistent novel view and pose synthesis, we introduce a Human NeRF module in HumanGif to learn spatially aligned features from the input image, implicitly capturing the relative camera and human pose transformation. Furthermore, we introduce an image-level loss during optimization to bridge the gap between latent and image spaces in diffusion models. Extensive experiments on RenderPeople and DNA-Rendering datasets demonstrate that HumanGif achieves the best perceptual performance, with better generalizability for novel view and pose synthesis.

Problem

Research questions and friction points this paper is trying to address.

Synthesize view-consistent animatable human avatars

Ensure pose-consistent results from single image

Bridge gap between latent and image spaces

Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-view human diffusion model

Human NeRF module

Image-level loss optimization

🔎 Similar Papers

DiffMesh: A Motion-aware Diffusion Framework for Human Mesh Recovery from Videos