PoseCraft: Tokenized 3D Body Landmark and Camera Conditioning for Photorealistic Human Image Synthesis

📅 2026-02-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods for generating photorealistic human images with controlled 3D poses and camera viewpoints often rely on cumbersome manual rigging or per-pose optimization, struggling to balance efficiency and fine-grained detail preservation. This work proposes a novel diffusion-based framework that, for the first time, encodes sparse 3D human keypoints and camera extrinsics into discrete conditional tokens and injects them into the diffusion process via cross-attention mechanisms. This approach effectively circumvents ambiguities arising from 2D reprojection and maintains 3D semantic consistency under large pose and viewpoint variations. Coupled with our in-house GenHumanRF data generation pipeline, the method significantly outperforms existing diffusion models in perceptual quality and fidelity of intricate details—such as fabric textures and hair strands—and matches or exceeds state-of-the-art neural rendering techniques.

Technology Category

Application Category

📝 Abstract
Digitizing humans and synthesizing photorealistic avatars with explicit 3D pose and camera controls are central to VR, telepresence, and entertainment. Existing skinning-based workflows require laborious manual rigging or template-based fittings, while neural volumetric methods rely on canonical templates and re-optimization for each unseen pose. We present PoseCraft, a diffusion framework built around tokenized 3D interface: instead of relying only on rasterized geometry as 2D control images, we encode sparse 3D landmarks and camera extrinsics as discrete conditioning tokens and inject them into diffusion via cross-attention. Our approach preserves 3D semantics by avoiding 2D re-projection ambiguity under large pose and viewpoint changes, and produces photorealistic imagery that faithfully captures identity and appearance. To train and evaluate at scale, we also implement GenHumanRF, a data generation workflow that renders diverse supervision from volumetric reconstructions. Our experiments show that PoseCraft achieves significant perceptual quality improvement over diffusion-centric methods, and attains better or comparable metrics to latest volumetric rendering SOTA while better preserving fabric and hair details.
Problem

Research questions and friction points this paper is trying to address.

photorealistic human synthesis
3D pose control
camera conditioning
identity preservation
appearance detail
Innovation

Methods, ideas, or system contributions that make the work stand out.

tokenized 3D conditioning
diffusion-based human synthesis
3D body landmarks
camera-aware generation
photorealistic avatar
🔎 Similar Papers
No similar papers found.
Z
Zhilin Guo
University of Cambridge
Jing Yang
Jing Yang
University of Cambridge
Computer VisionFace analysis
Kyle Fogarty
Kyle Fogarty
University of Cambridge
Geometry ProcessingGeometric Deep LearningApplied Mathematics
J
Jingyi Wan
University of Cambridge
B
Boqiao Zhang
University of Cambridge
Tianhao Wu
Tianhao Wu
PhD student, University of Cambridge
computer vision3D reconstructionimlicit representation
W
Weihao Xia
University of Cambridge
Chenliang Zhou
Chenliang Zhou
University of Cambridge
machine learninggenerative artificial intelligencecomputer visioncomputer graphics
S
Sakar Khattar
Google
F
Fangcheng Zhong
University of Cambridge
C
Cristina Nader Vasconcelos
Google
C
Cengiz Oztireli
University of Cambridge, Google