Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

📅 2025-12-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses three core challenges in audio-visual joint generation: cross-modal synchronization difficulty, weak narrative coherence, and low generation fidelity. We propose the first native audio-visual joint generative foundation model. Methodologically, we design a dual-branch diffusion Transformer architecture augmented with a cross-modal fusion module, and employ multi-stage alignment-aware data curation, supervised fine-tuning (SFT), and human feedback-driven reinforcement learning (RLHF) guided by a multidimensional reward model. Our model uniquely supports multilingual/dialectal precise lip-sync, dynamic cinematic camera motion, and strong narrative consistency. A custom inference acceleration framework achieves over 10× speedup while significantly improving synchronization accuracy and audio-visual quality. The model has been deployed on VolcEngine and is publicly available for professional content creation.

Technology Category

Application Category

📝 Abstract
Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality. To ensure practical utility, we implement meticulous post-training optimizations, including Supervised Fine-Tuning (SFT) on high-quality datasets and Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models. Furthermore, we introduce an acceleration framework that boosts inference speed by over 10X. Seedance 1.5 pro distinguishes itself through precise multilingual and dialect lip-syncing, dynamic cinematic camera control, and enhanced narrative coherence, positioning it as a robust engine for professional-grade content creation. Seedance 1.5 pro is now accessible on Volcano Engine at https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?type=GenVideo.
Problem

Research questions and friction points this paper is trying to address.

Develops a native audio-visual joint generation foundation model
Achieves superior synchronization and quality via dual-branch architecture
Enhances practical utility with optimizations and accelerated inference
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-branch Diffusion Transformer for joint audio-video generation
Multi-stage data pipeline with cross-modal module for synchronization
Post-training optimizations including SFT and RLHF for quality
🔎 Similar Papers
No similar papers found.
S
Siyan Chen
ByteDance Seed
Yanfei Chen
Yanfei Chen
Google
medical devicefinite element analysislarge language models
Y
Ying Chen
ByteDance Seed
Z
Zhuo Chen
ByteDance Seed
F
Feng Cheng
ByteDance Seed
X
Xuyan Chi
ByteDance Seed
Jian Cong
Jian Cong
ByteDance Seed
speech
Q
Qinpeng Cui
ByteDance Seed
Q
Qide Dong
ByteDance Seed
Junliang Fan
Junliang Fan
ByteDance Seed
Jing Fang
Jing Fang
Northwestern Polytechnical University
Image ProcessingDeep Learning
Z
Zetao Fang
ByteDance Seed
Chengjian Feng
Chengjian Feng
Meituan
Computer VisionObject Detection
H
Han Feng
ByteDance Seed
Mingyuan Gao
Mingyuan Gao
Professor, Institute of Chemistry, Chinese Academy of Sciences
Y
Yu Gao
ByteDance Seed
Qiushan Guo
Qiushan Guo
The University of Hong Kong; ByteDance
Deep LearningComputer Vision
B
Boyang Hao
ByteDance Seed
Q
Qingkai Hao
ByteDance Seed
B
Bibo He
ByteDance Seed
Qian He
Qian He
ByteDance
T
Tuyen Hoang
ByteDance Seed
R
Ruoqing Hu
ByteDance Seed
X
Xi Hu
ByteDance Seed
Weilin Huang
Weilin Huang
Bytedance Seed
Computer VisionDeep Learning