Multi-identity Human Image Animation with Structural Video Diffusion

📅 2025-04-05

📈 Citations: 0

✨ Influential: 0

career value

234K/year

🤖 AI Summary

Existing single-image-to-video methods for human motion generation perform poorly in multi-person interaction scenarios, primarily due to difficulties in modeling cross-subject appearance-pose consistency and 3D-aware human-object dynamic interactions. This paper introduces the first single-image-driven video generation framework tailored for multi-identity interaction. Our approach comprises three key components: (1) an identity-specific embedding mechanism that ensures appearance stability and pose disentanglement across multiple subjects; (2) a geometry-aware conditioning module integrating depth and surface normals to explicitly model 3D spatial relationships between humans and objects; and (3) a diffusion-based structured temporal modeling architecture. We further release a large-scale multi-identity interaction video dataset containing 25K samples. Extensive experiments demonstrate that our method achieves state-of-the-art performance in visual quality, temporal coherence, and interaction plausibility.

Technology Category

Application Category

📝 Abstract

Generating human videos from a single image while ensuring high visual quality and precise control is a challenging task, especially in complex scenarios involving multiple individuals and interactions with objects. Existing methods, while effective for single-human cases, often fail to handle the intricacies of multi-identity interactions because they struggle to associate the correct pairs of human appearance and pose condition and model the distribution of 3D-aware dynamics. To address these limitations, we present Structural Video Diffusion, a novel framework designed for generating realistic multi-human videos. Our approach introduces two core innovations: identity-specific embeddings to maintain consistent appearances across individuals and a structural learning mechanism that incorporates depth and surface-normal cues to model human-object interactions. Additionally, we expand existing human video dataset with 25K new videos featuring diverse multi-human and object interaction scenarios, providing a robust foundation for training. Experimental results demonstrate that Structural Video Diffusion achieves superior performance in generating lifelike, coherent videos for multiple subjects with dynamic and rich interactions, advancing the state of human-centric video generation.

Problem

Research questions and friction points this paper is trying to address.

Generating high-quality multi-human videos from single images

Handling complex multi-identity interactions and object dynamics

Ensuring consistent appearance-pose alignment in video generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Structural Video Diffusion for multi-human videos

Identity-specific embeddings for appearance consistency

Structural learning with depth and surface-normal cues

🔎 Similar Papers

DiffMesh: A Motion-aware Diffusion Framework for Human Mesh Recovery from Videos