SkyReels-V3 Technique Report

📅 2026-01-24
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work proposes a unified multimodal video generation framework based on diffusion Transformers, designed to simultaneously achieve high visual fidelity, subject consistency, temporal coherence, and multimodal alignment. The framework supports three core tasks: reference-image-to-video generation, video extension, and audio-driven talking-head synthesis. High-quality training data is constructed through cross-frame pairing, image editing, and semantic rewriting, while a novel mechanism incorporating first-and-last frame insertion and keyframe-based inference enhances generation control. By integrating image-video hybrid training, multi-resolution joint optimization, and explicit spatiotemporal consistency modeling, the model achieves state-of-the-art or near state-of-the-art performance across visual quality, instruction following, and task-specific metrics, rivaling leading closed-source systems.

Technology Category

Application Category

📝 Abstract
Video generation serves as a cornerstone for building world models, where multimodal contextual inference stands as the defining test of capability. In this end, we present SkyReels-V3, a conditional video generation model, built upon a unified multimodal in-context learning framework with diffusion Transformers. SkyReels-V3 model supports three core generative paradigms within a single architecture: reference images-to-video synthesis, video-to-video extension and audio-guided video generation. (i) reference images-to-video model is designed to produce high-fidelity videos with strong subject identity preservation, temporal coherence, and narrative consistency. To enhance reference adherence and compositional stability, we design a comprehensive data processing pipeline that leverages cross frame pairing, image editing, and semantic rewriting, effectively mitigating copy paste artifacts. During training, an image video hybrid strategy combined with multi-resolution joint optimization is employed to improve generalization and robustness across diverse scenarios. (ii) video extension model integrates spatio-temporal consistency modeling with large-scale video understanding, enabling both seamless single-shot continuation and intelligent multi-shot switching with professional cinematographic patterns. (iii) Talking avatar model supports minute-level audio-conditioned video generation by training first-and-last frame insertion patterns and reconstructing key-frame inference paradigms. On the basis of ensuring visual quality, synchronization of audio and videos has been optimized. Extensive evaluations demonstrate that SkyReels-V3 achieves state-of-the-art or near state-of-the-art performance on key metrics including visual quality, instruction following, and specific aspect metrics, approaching leading closed-source systems. Github: https://github.com/SkyworkAI/SkyReels-V3.
Problem

Research questions and friction points this paper is trying to address.

conditional video generation
multimodal in-context learning
reference images-to-video synthesis
video-to-video extension
audio-guided video generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

diffusion Transformers
multimodal in-context learning
reference-guided video generation
audio-conditioned video synthesis
spatio-temporal consistency
🔎 Similar Papers