OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models

📅 2025-02-03

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

Existing audio-driven portrait animation methods struggle to simultaneously achieve complex motion synthesis, multi-style adaptability, and flexible input conditioning, limiting their scalability for large-scale video generation. This paper introduces the first universal diffusion Transformer framework for full-scale portraits—spanning face, head-and-shoulders, upper-body, and full-body configurations—enabling end-to-end, multimodal-driven, high-fidelity human animation. Our approach integrates joint encoding of audio, reference video, and pose cues; dynamic mask scheduling; and cross-modal alignment distillation. We further propose a large-scale conditional hybrid training paradigm and a dual-principle training mechanism, overcoming the limitations of unimodal audio-only driving and supporting diverse scenarios including speech, singing, and human-object interaction. Experiments demonstrate significant improvements over state-of-the-art methods: +2.1 dB in PSNR and −37% in FID. The framework exhibits strong generalization capability and style extensibility.

Technology Category

Application Category

📝 Abstract

End-to-end human animation, such as audio-driven talking human generation, has undergone notable advancements in the recent few years. However, existing methods still struggle to scale up as large general video generation models, limiting their potential in real applications. In this paper, we propose OmniHuman, a Diffusion Transformer-based framework that scales up data by mixing motion-related conditions into the training phase. To this end, we introduce two training principles for these mixed conditions, along with the corresponding model architecture and inference strategy. These designs enable OmniHuman to fully leverage data-driven motion generation, ultimately achieving highly realistic human video generation. More importantly, OmniHuman supports various portrait contents (face close-up, portrait, half-body, full-body), supports both talking and singing, handles human-object interactions and challenging body poses, and accommodates different image styles. Compared to existing end-to-end audio-driven methods, OmniHuman not only produces more realistic videos, but also offers greater flexibility in inputs. It also supports multiple driving modalities (audio-driven, video-driven and combined driving signals). Video samples are provided on the ttfamily project page (https://omnihuman-lab.github.io)

Problem

Research questions and friction points this paper is trying to address.

Voice-Controlled Animation

Large-Scale Video Generation

Complex Body Motion

Innovation

Methods, ideas, or system contributions that make the work stand out.

OmniHuman

Diffusion Transformer

Multimodal Video Generation

🔎 Similar Papers

No similar papers found.