🤖 AI Summary
Existing methods for generating photorealistic human images with controlled 3D poses and camera viewpoints often rely on cumbersome manual rigging or per-pose optimization, struggling to balance efficiency and fine-grained detail preservation. This work proposes a novel diffusion-based framework that, for the first time, encodes sparse 3D human keypoints and camera extrinsics into discrete conditional tokens and injects them into the diffusion process via cross-attention mechanisms. This approach effectively circumvents ambiguities arising from 2D reprojection and maintains 3D semantic consistency under large pose and viewpoint variations. Coupled with our in-house GenHumanRF data generation pipeline, the method significantly outperforms existing diffusion models in perceptual quality and fidelity of intricate details—such as fabric textures and hair strands—and matches or exceeds state-of-the-art neural rendering techniques.
📝 Abstract
Digitizing humans and synthesizing photorealistic avatars with explicit 3D pose and camera controls are central to VR, telepresence, and entertainment. Existing skinning-based workflows require laborious manual rigging or template-based fittings, while neural volumetric methods rely on canonical templates and re-optimization for each unseen pose. We present PoseCraft, a diffusion framework built around tokenized 3D interface: instead of relying only on rasterized geometry as 2D control images, we encode sparse 3D landmarks and camera extrinsics as discrete conditioning tokens and inject them into diffusion via cross-attention. Our approach preserves 3D semantics by avoiding 2D re-projection ambiguity under large pose and viewpoint changes, and produces photorealistic imagery that faithfully captures identity and appearance. To train and evaluate at scale, we also implement GenHumanRF, a data generation workflow that renders diverse supervision from volumetric reconstructions. Our experiments show that PoseCraft achieves significant perceptual quality improvement over diffusion-centric methods, and attains better or comparable metrics to latest volumetric rendering SOTA while better preserving fabric and hair details.