AudCast: Audio-Driven Human Video Generation by Cascaded Diffusion Transformers

📅 2025-03-25

📈 Citations: 0

✨ Influential: 0

career value

220K/year

🤖 AI Summary

Existing audio-driven video generation methods primarily focus on facial motion, resulting in inconsistent head and body movements. To address this, we propose an end-to-end audio-driven full-body portrait video generation framework that synthesizes high-fidelity lip motion, natural head pose, and nuanced hand gestures from a single reference image and speech input. Our method introduces a novel cascaded Diffusion Transformer (DiT) architecture—the first to jointly model full-body motion dynamics and region-specific detail refinement (for face and hands)—and incorporates region-level 3D human pose estimation as a cross-modal bridging signal. It integrates audio-conditioned modeling, multi-scale spatiotemporal attention, and a cascaded generation paradigm. Extensive experiments demonstrate significant improvements over state-of-the-art methods in lip-sync accuracy, full-body motion coherence, and fidelity of facial and hand details.

Technology Category

Application Category

📝 Abstract

Despite the recent progress of audio-driven video generation, existing methods mostly focus on driving facial movements, leading to non-coherent head and body dynamics. Moving forward, it is desirable yet challenging to generate holistic human videos with both accurate lip-sync and delicate co-speech gestures w.r.t. given audio. In this work, we propose AudCast, a generalized audio-driven human video generation framework adopting a cascade Diffusion-Transformers (DiTs) paradigm, which synthesizes holistic human videos based on a reference image and a given audio. 1) Firstly, an audio-conditioned Holistic Human DiT architecture is proposed to directly drive the movements of any human body with vivid gesture dynamics. 2) Then to enhance hand and face details that are well-knownly difficult to handle, a Regional Refinement DiT leverages regional 3D fitting as the bridge to reform the signals, producing the final results. Extensive experiments demonstrate that our framework generates high-fidelity audio-driven holistic human videos with temporal coherence and fine facial and hand details. Resources can be found at https://guanjz20.github.io/projects/AudCast.

Problem

Research questions and friction points this paper is trying to address.

Generates holistic human videos from audio with accurate lip-sync

Enhances hand and face details using regional refinement techniques

Ensures temporal coherence and fine details in audio-driven human videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cascaded Diffusion Transformers for holistic video generation

Audio-conditioned Holistic Human DiT architecture

Regional Refinement DiT for detailed hand and face

🔎 Similar Papers

Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency