SkyReels-A1: Expressive Portrait Animation in Video Diffusion Transformers

📅 2025-02-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing approaches to avatar animation suffer from identity distortion, background jitter, unnatural facial dynamics, and poor body-proportion adaptation. This paper introduces SkyReels-A1, a novel high-fidelity avatar animation framework built upon the Video Diffusion Transformer (Video DiT). It incorporates an expression-aware conditioning module for fine-grained motion control; a facial image–text alignment mechanism that jointly encodes identity features and action semantics; and a multi-stage progressive training paradigm that jointly optimizes identity stability and expression authenticity. Extensive evaluations demonstrate significant improvements in visual coherence, identity fidelity, and temporal consistency—both on avatar-specific benchmarks and across diverse body morphologies. The method is validated in practical applications including virtual avatars, remote telepresence, and digital media production, establishing new state-of-the-art performance.

Technology Category

Application Category

📝 Abstract
We present SkyReels-A1, a simple yet effective framework built upon video diffusion Transformer to facilitate portrait image animation. Existing methodologies still encounter issues, including identity distortion, background instability, and unrealistic facial dynamics, particularly in head-only animation scenarios. Besides, extending to accommodate diverse body proportions usually leads to visual inconsistencies or unnatural articulations. To address these challenges, SkyReels-A1 capitalizes on the strong generative capabilities of video DiT, enhancing facial motion transfer precision, identity retention, and temporal coherence. The system incorporates an expression-aware conditioning module that enables seamless video synthesis driven by expression-guided landmark inputs. Integrating the facial image-text alignment module strengthens the fusion of facial attributes with motion trajectories, reinforcing identity preservation. Additionally, SkyReels-A1 incorporates a multi-stage training paradigm to incrementally refine the correlation between expressions and motion while ensuring stable identity reproduction. Extensive empirical evaluations highlight the model's ability to produce visually coherent and compositionally diverse results, making it highly applicable to domains such as virtual avatars, remote communication, and digital media generation.
Problem

Research questions and friction points this paper is trying to address.

Addresses identity distortion in portrait animation
Enhances facial motion transfer precision
Improves temporal coherence in video synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Video Diffusion Transformer
Expression-aware Conditioning Module
Facial Image-Text Alignment
🔎 Similar Papers
No similar papers found.
D
Di Qiu
Skywork AI
Zhengcong Fei
Zhengcong Fei
ICT, UCAS
MLLMdiffusion models
R
Rui Wang
Skywork AI
J
Jialin Bai
Skywork AI
C
Changqian Yu
Skywork AI
Mingyuan Fan
Mingyuan Fan
Kunlun Inc
AIGC Semantic Segmentation
Guibin Chen
Guibin Chen
Skywork AI
Video Generative modelsReinforcement LearningGame AI
X
Xiang Wen
Skywork AI