Human Video Generation from a Single Image with 3D Pose and View Control

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

Generating high-quality human videos from a single image with consistent viewpoints and natural clothing dynamics remains challenging. This work proposes HVG, a latent video diffusion model that introduces a dual-dimensional skeletal graph to capture anatomical joint relationships and integrates 3D pose and viewpoint conditioning to mitigate self-occlusion. To ensure multi-view consistency, HVG incorporates a viewpoint–temporal alignment mechanism and employs a progressive spatiotemporal sampling strategy to produce temporally smooth, long-sequence animations. Experimental results demonstrate that HVG significantly outperforms existing methods in terms of generation quality, viewpoint consistency, and motion naturalness.

Technology Category

Application Category

📝 Abstract

Recent diffusion methods have made significant progress in generating videos from single images due to their powerful visual generation capabilities. However, challenges persist in image-to-video synthesis, particularly in human video generation, where inferring view-consistent, motion-dependent clothing wrinkles from a single image remains a formidable problem. In this paper, we present Human Video Generation in 4D (HVG), a latent video diffusion model capable of generating high-quality, multi-view, spatiotemporally coherent human videos from a single image with 3D pose and view control. HVG achieves this through three key designs: (i) Articulated Pose Modulation, which captures the anatomical relationships of 3D joints via a novel dual-dimensional bone map and resolves self-occlusions across views by introducing 3D information; (ii) View and Temporal Alignment, which ensures multi-view consistency and alignment between a reference image and pose sequences for frame-to-frame stability; and (iii) Progressive Spatio-Temporal Sampling with temporal alignment to maintain smooth transitions in long multi-view animations. Extensive experiments on image-to-video tasks demonstrate that HVG outperforms existing methods in generating high-quality 4D human videos from diverse human images and pose inputs.

Problem

Research questions and friction points this paper is trying to address.

human video generation

single image

view consistency

motion-dependent clothing wrinkles

image-to-video synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D pose control

multi-view consistency

latent video diffusion