🤖 AI Summary
This work addresses three core challenges in audio-visual joint generation: cross-modal synchronization difficulty, weak narrative coherence, and low generation fidelity. We propose the first native audio-visual joint generative foundation model. Methodologically, we design a dual-branch diffusion Transformer architecture augmented with a cross-modal fusion module, and employ multi-stage alignment-aware data curation, supervised fine-tuning (SFT), and human feedback-driven reinforcement learning (RLHF) guided by a multidimensional reward model. Our model uniquely supports multilingual/dialectal precise lip-sync, dynamic cinematic camera motion, and strong narrative consistency. A custom inference acceleration framework achieves over 10× speedup while significantly improving synchronization accuracy and audio-visual quality. The model has been deployed on VolcEngine and is publicly available for professional content creation.
📝 Abstract
Recent strides in video generation have paved the way for unified audio-visual generation. In this work, we present Seedance 1.5 pro, a foundational model engineered specifically for native, joint audio-video generation. Leveraging a dual-branch Diffusion Transformer architecture, the model integrates a cross-modal joint module with a specialized multi-stage data pipeline, achieving exceptional audio-visual synchronization and superior generation quality. To ensure practical utility, we implement meticulous post-training optimizations, including Supervised Fine-Tuning (SFT) on high-quality datasets and Reinforcement Learning from Human Feedback (RLHF) with multi-dimensional reward models. Furthermore, we introduce an acceleration framework that boosts inference speed by over 10X. Seedance 1.5 pro distinguishes itself through precise multilingual and dialect lip-syncing, dynamic cinematic camera control, and enhanced narrative coherence, positioning it as a robust engine for professional-grade content creation. Seedance 1.5 pro is now accessible on Volcano Engine at https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?type=GenVideo.