SkyReels-Audio: Omni Audio-Conditioned Talking Portraits in Video Diffusion Transformers

📅 2025-06-01

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses the problem of audio-driven talking-head video generation and editing under multimodal conditions, supporting diverse inputs including text, images, and videos, while synthesizing arbitrarily long, high-fidelity, temporally coherent videos. Methodologically, it (1) introduces a novel hybrid curriculum learning strategy to achieve fine-grained audio–lip motion alignment; (2) incorporates a facial mask loss and an audio-guided classifier-free guidance mechanism to enhance identity preservation and lip-sync accuracy; and (3) designs a sliding-window denoising scheme to model long-range temporal dependencies in latent representations. Built upon a pre-trained video diffusion Transformer, the framework leverages a triplet-based audio–video–text data curation pipeline. Extensive experiments demonstrate significant improvements over state-of-the-art methods in lip-sync precision, identity fidelity, and facial motion naturalness—particularly for challenging scenarios involving complex speech and cross-identity, long-duration video synthesis.

Technology Category

Application Category

📝 Abstract

The generation and editing of audio-conditioned talking portraits guided by multimodal inputs, including text, images, and videos, remains under explored. In this paper, we present SkyReels-Audio, a unified framework for synthesizing high-fidelity and temporally coherent talking portrait videos. Built upon pretrained video diffusion transformers, our framework supports infinite-length generation and editing, while enabling diverse and controllable conditioning through multimodal inputs. We employ a hybrid curriculum learning strategy to progressively align audio with facial motion, enabling fine-grained multimodal control over long video sequences. To enhance local facial coherence, we introduce a facial mask loss and an audio-guided classifier-free guidance mechanism. A sliding-window denoising approach further fuses latent representations across temporal segments, ensuring visual fidelity and temporal consistency across extended durations and diverse identities. More importantly, we construct a dedicated data pipeline for curating high-quality triplets consisting of synchronized audio, video, and textual descriptions. Comprehensive benchmark evaluations show that SkyReels-Audio achieves superior performance in lip-sync accuracy, identity consistency, and realistic facial dynamics, particularly under complex and challenging conditions.

Problem

Research questions and friction points this paper is trying to address.

Generating audio-conditioned talking portrait videos

Achieving high-fidelity and temporal coherence

Enabling multimodal control with text, images, videos

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid curriculum learning aligns audio with facial motion

Facial mask loss enhances local facial coherence

Sliding-window denoising ensures visual fidelity and consistency

🔎 Similar Papers

Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency