Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model

📅 2025-12-15

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This work addresses three core challenges in audio-visual joint generation: cross-modal synchronization difficulty, weak narrative coherence, and low generation fidelity. We propose the first native audio-visual joint generative foundation model. Methodologically, we design a dual-branch diffusion Transformer architecture augmented with a cross-modal fusion module, and employ multi-stage alignment-aware data curation, supervised fine-tuning (SFT), and human feedback-driven reinforcement learning (RLHF) guided by a multidimensional reward model. Our model uniquely supports multilingual/dialectal precise lip-sync, dynamic cinematic camera motion, and strong narrative consistency. A custom inference acceleration framework achieves over 10× speedup while significantly improving synchronization accuracy and audio-visual quality. The model has been deployed on VolcEngine and is publicly available for professional content creation.

Technology Category

Application Category

📝 Abstract

Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality. To ensure practical utility, we implement meticulous post-training optimizations, including Supervised Fine-Tuning (SFT) on high-quality datasets and Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models. Furthermore, we introduce an acceleration framework that boosts inference speed by over 10X. Seedance 1.5 pro distinguishes itself through precise multilingual and dialect lip-syncing, dynamic cinematic camera control, and enhanced narrative coherence, positioning it as a robust engine for professional-grade content creation. Seedance 1.5 pro is now accessible on Volcano Engine at https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?type=GenVideo.

Problem

Research questions and friction points this paper is trying to address.

Develops a native audio-visual joint generation foundation model

Achieves superior synchronization and quality via dual-branch architecture

Enhances practical utility with optimizations and accelerated inference

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-branch Diffusion Transformer for joint audio-video generation

Multi-stage data pipeline with cross-modal module for synchronization

Post-training optimizations including SFT and RLHF for quality

🔎 Similar Papers

No similar papers found.