🤖 AI Summary
Existing methods for long-duration (>5 seconds) human motion video generation struggle to simultaneously preserve fine-grained hand and facial details, adapt to multi-resolution inputs, and maintain cross-frame visual consistency. To address these challenges, we propose Pose-Guided DiT: (1) a novel keypoint-guided DiT architecture integrating Keypoint-DiT with a lightweight Pose Adapter for precise pose transfer and image-conditioned generation; (2) a prefix-latent reference mechanism ensuring personalized consistency over extended sequences; and (3) native support for variable-length and dynamic-resolution output. Trained on 14,000 hours of high-quality in-the-wild human videos, our model significantly improves pose accuracy and perceptual realism—especially for complex motions and diverse scenes. It is the first method to enable single-image or single-video-driven, high-fidelity, high-resolution, long-duration human motion synthesis with strong inter-frame consistency.
📝 Abstract
Human motion video generation has advanced significantly, while existing methods still struggle with accurately rendering detailed body parts like hands and faces, especially in long sequences and intricate motions. Current approaches also rely on fixed resolution and struggle to maintain visual consistency. To address these limitations, we propose HumanDiT, a pose-guided Diffusion Transformer (DiT)-based framework trained on a large and wild dataset containing 14,000 hours of high-quality video to produce high-fidelity videos with fine-grained body rendering. Specifically, (i) HumanDiT, built on DiT, supports numerous video resolutions and variable sequence lengths, facilitating learning for long-sequence video generation; (ii) we introduce a prefix-latent reference strategy to maintain personalized characteristics across extended sequences. Furthermore, during inference, HumanDiT leverages Keypoint-DiT to generate subsequent pose sequences, facilitating video continuation from static images or existing videos. It also utilizes a Pose Adapter to enable pose transfer with given sequences. Extensive experiments demonstrate its superior performance in generating long-form, pose-accurate videos across diverse scenarios.