🤖 AI Summary
Generating high-quality human videos from a single image with consistent viewpoints and natural clothing dynamics remains challenging. This work proposes HVG, a latent video diffusion model that introduces a dual-dimensional skeletal graph to capture anatomical joint relationships and integrates 3D pose and viewpoint conditioning to mitigate self-occlusion. To ensure multi-view consistency, HVG incorporates a viewpoint–temporal alignment mechanism and employs a progressive spatiotemporal sampling strategy to produce temporally smooth, long-sequence animations. Experimental results demonstrate that HVG significantly outperforms existing methods in terms of generation quality, viewpoint consistency, and motion naturalness.
📝 Abstract
Recent diffusion methods have made significant progress in generating videos from single images due to their powerful visual generation capabilities. However, challenges persist in image-to-video synthesis, particularly in human video generation, where inferring view-consistent, motion-dependent clothing wrinkles from a single image remains a formidable problem. In this paper, we present Human Video Generation in 4D (HVG), a latent video diffusion model capable of generating high-quality, multi-view, spatiotemporally coherent human videos from a single image with 3D pose and view control. HVG achieves this through three key designs: (i) Articulated Pose Modulation, which captures the anatomical relationships of 3D joints via a novel dual-dimensional bone map and resolves self-occlusions across views by introducing 3D information; (ii) View and Temporal Alignment, which ensures multi-view consistency and alignment between a reference image and pose sequences for frame-to-frame stability; and (iii) Progressive Spatio-Temporal Sampling with temporal alignment to maintain smooth transitions in long multi-view animations. Extensive experiments on image-to-video tasks demonstrate that HVG outperforms existing methods in generating high-quality 4D human videos from diverse human images and pose inputs.