SkyReels-Audio: Omni Audio-Conditioned Talking Portraits in Video Diffusion Transformers

📅 2025-06-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of audio-driven talking-head video generation and editing under multimodal conditions, supporting diverse inputs including text, images, and videos, while synthesizing arbitrarily long, high-fidelity, temporally coherent videos. Methodologically, it (1) introduces a novel hybrid curriculum learning strategy to achieve fine-grained audio–lip motion alignment; (2) incorporates a facial mask loss and an audio-guided classifier-free guidance mechanism to enhance identity preservation and lip-sync accuracy; and (3) designs a sliding-window denoising scheme to model long-range temporal dependencies in latent representations. Built upon a pre-trained video diffusion Transformer, the framework leverages a triplet-based audio–video–text data curation pipeline. Extensive experiments demonstrate significant improvements over state-of-the-art methods in lip-sync precision, identity fidelity, and facial motion naturalness—particularly for challenging scenarios involving complex speech and cross-identity, long-duration video synthesis.

Technology Category

Application Category

📝 Abstract
The generation and editing of audio-conditioned talking portraits guided by multimodal inputs, including text, images, and videos, remains under explored. In this paper, we present SkyReels-Audio, a unified framework for synthesizing high-fidelity and temporally coherent talking portrait videos. Built upon pretrained video diffusion transformers, our framework supports infinite-length generation and editing, while enabling diverse and controllable conditioning through multimodal inputs. We employ a hybrid curriculum learning strategy to progressively align audio with facial motion, enabling fine-grained multimodal control over long video sequences. To enhance local facial coherence, we introduce a facial mask loss and an audio-guided classifier-free guidance mechanism. A sliding-window denoising approach further fuses latent representations across temporal segments, ensuring visual fidelity and temporal consistency across extended durations and diverse identities. More importantly, we construct a dedicated data pipeline for curating high-quality triplets consisting of synchronized audio, video, and textual descriptions. Comprehensive benchmark evaluations show that SkyReels-Audio achieves superior performance in lip-sync accuracy, identity consistency, and realistic facial dynamics, particularly under complex and challenging conditions.
Problem

Research questions and friction points this paper is trying to address.

Generating audio-conditioned talking portrait videos
Achieving high-fidelity and temporal coherence
Enabling multimodal control with text, images, videos
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid curriculum learning aligns audio with facial motion
Facial mask loss enhances local facial coherence
Sliding-window denoising ensures visual fidelity and consistency
🔎 Similar Papers
Zhengcong Fei
Zhengcong Fei
ICT, UCAS
MLLMdiffusion models
H
Hao Jiang
SkyReels Team, Skywork AI
D
Di Qiu
SkyReels Team, Skywork AI
Baoxuan Gu
Baoxuan Gu
Beijing University of Posts and Telecommunications
Y
Youqiang Zhang
SkyReels Team, Skywork AI
J
Jiahua Wang
SkyReels Team, Skywork AI
J
Jialin Bai
SkyReels Team, Skywork AI
Debang Li
Debang Li
nlpr
Deep LearningComputer Vision
Mingyuan Fan
Mingyuan Fan
Kunlun Inc
AIGC Semantic Segmentation
Guibin Chen
Guibin Chen
Skywork AI
Video Generative modelsReinforcement LearningGame AI
Y
Yahui Zhou
SkyReels Team, Skywork AI