OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models

📅 2025-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audio-driven portrait animation methods struggle to simultaneously achieve complex motion synthesis, multi-style adaptability, and flexible input conditioning, limiting their scalability for large-scale video generation. This paper introduces the first universal diffusion Transformer framework for full-scale portraits—spanning face, head-and-shoulders, upper-body, and full-body configurations—enabling end-to-end, multimodal-driven, high-fidelity human animation. Our approach integrates joint encoding of audio, reference video, and pose cues; dynamic mask scheduling; and cross-modal alignment distillation. We further propose a large-scale conditional hybrid training paradigm and a dual-principle training mechanism, overcoming the limitations of unimodal audio-only driving and supporting diverse scenarios including speech, singing, and human-object interaction. Experiments demonstrate significant improvements over state-of-the-art methods: +2.1 dB in PSNR and −37% in FID. The framework exhibits strong generalization capability and style extensibility.

Technology Category

Application Category

📝 Abstract
End-to-end human animation, such as audio-driven talking human generation, has undergone notable advancements in the recent few years. However, existing methods still struggle to scale up as large general video generation models, limiting their potential in real applications. In this paper, we propose OmniHuman, a Diffusion Transformer-based framework that scales up data by mixing motion-related conditions into the training phase. To this end, we introduce two training principles for these mixed conditions, along with the corresponding model architecture and inference strategy. These designs enable OmniHuman to fully leverage data-driven motion generation, ultimately achieving highly realistic human video generation. More importantly, OmniHuman supports various portrait contents (face close-up, portrait, half-body, full-body), supports both talking and singing, handles human-object interactions and challenging body poses, and accommodates different image styles. Compared to existing end-to-end audio-driven methods, OmniHuman not only produces more realistic videos, but also offers greater flexibility in inputs. It also supports multiple driving modalities (audio-driven, video-driven and combined driving signals). Video samples are provided on the ttfamily project page (https://omnihuman-lab.github.io)
Problem

Research questions and friction points this paper is trying to address.

Voice-Controlled Animation
Large-Scale Video Generation
Complex Body Motion
Innovation

Methods, ideas, or system contributions that make the work stand out.

OmniHuman
Diffusion Transformer
Multimodal Video Generation
🔎 Similar Papers
No similar papers found.
Gaojie Lin
Gaojie Lin
Bytedance
J
Jianwen Jiang
ByteDance
J
Jiaqi Yang
ByteDance
Zerong Zheng
Zerong Zheng
Bytedance
Computer VisionComputer Graphics
C
Chao Liang
ByteDance