PoseCraft: Tokenized 3D Body Landmark and Camera Conditioning for Photorealistic Human Image Synthesis

📅 2026-02-22

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Existing methods for generating photorealistic human images with controlled 3D poses and camera viewpoints often rely on cumbersome manual rigging or per-pose optimization, struggling to balance efficiency and fine-grained detail preservation. This work proposes a novel diffusion-based framework that, for the first time, encodes sparse 3D human keypoints and camera extrinsics into discrete conditional tokens and injects them into the diffusion process via cross-attention mechanisms. This approach effectively circumvents ambiguities arising from 2D reprojection and maintains 3D semantic consistency under large pose and viewpoint variations. Coupled with our in-house GenHumanRF data generation pipeline, the method significantly outperforms existing diffusion models in perceptual quality and fidelity of intricate details—such as fabric textures and hair strands—and matches or exceeds state-of-the-art neural rendering techniques.

Technology Category

Application Category

📝 Abstract

Digitizing humans and synthesizing photorealistic avatars with explicit 3D pose and camera controls are central to VR, telepresence, and entertainment. Existing skinning-based workflows require laborious manual rigging or template-based fittings, while neural volumetric methods rely on canonical templates and re-optimization for each unseen pose. We present PoseCraft, a diffusion framework built around tokenized 3D interface: instead of relying only on rasterized geometry as 2D control images, we encode sparse 3D landmarks and camera extrinsics as discrete conditioning tokens and inject them into diffusion via cross-attention. Our approach preserves 3D semantics by avoiding 2D re-projection ambiguity under large pose and viewpoint changes, and produces photorealistic imagery that faithfully captures identity and appearance. To train and evaluate at scale, we also implement GenHumanRF, a data generation workflow that renders diverse supervision from volumetric reconstructions. Our experiments show that PoseCraft achieves significant perceptual quality improvement over diffusion-centric methods, and attains better or comparable metrics to latest volumetric rendering SOTA while better preserving fabric and hair details.

Problem

Research questions and friction points this paper is trying to address.

photorealistic human synthesis

3D pose control

camera conditioning

identity preservation

appearance detail

Innovation

Methods, ideas, or system contributions that make the work stand out.

tokenized 3D conditioning

diffusion-based human synthesis

3D body landmarks