Apollo: Unified Multi-Task Audio-Video Joint Generation

📅 2026-01-07
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses key challenges in audio-visual joint generation—namely, audio-visual asynchrony, poor lip-sync alignment, and performance degradation in individual modalities—by proposing a unified single-tower DiT architecture. The approach integrates an Omni-Full Attention mechanism with a multi-task random modality masking strategy and employs a multi-stage curriculum training pipeline. Additionally, the authors introduce the first large-scale audio-visual-text triplet dataset featuring dense caption annotations. The proposed method substantially outperforms existing models across multiple tasks, achieving performance comparable to Veo 3 while demonstrating strong generalization and cross-modal temporal-semantic consistency.

Technology Category

Application Category

📝 Abstract
Audio-video joint generation has progressed rapidly, yet substantial challenges still remain. Non-commercial approaches still suffer audio-visual asynchrony, poor lip-speech alignment, and unimodal degradation, which can be stemmed from weak audio-visual correspondence modeling, limited generalization, and scarce high-quality dense-caption data. To address these issues, we introduce Apollo and delve into three axes--model architecture, training strategy, and data curation. Architecturally, we adopt a single-tower design with unified DiT blocks and an Omni-Full Attention mechanism, achieving tight audio-visual alignment and strong scalability. Training-wise, we adopt a progressive multitask regime--random modality masking to joint optimization across tasks, and a multistage curriculum, yielding robust representations, strengthening A-V aligned world knowledge, and preventing unimodal collapse. For datasets, we present the first large-scale audio-video dataset with dense captions, and introduce a novel automated data-construction pipeline which annotates and filters millions of diverse, high-quality, strictly aligned audio-video-caption triplets. Building on this, Apollo scales to large datasets, delivering high-fidelity, semantically and temporally aligned, instruction-following generation in both joint and unimodal settings while generalizing robustly to out-of-distribution scenarios. Across tasks, it substantially outperforms prior methods by a large margin and achieves performance comparable to Veo 3, offering a unified, scalable path toward next-generation audio-video synthesis.
Problem

Research questions and friction points this paper is trying to address.

audio-video joint generation
audio-visual asynchrony
lip-speech alignment
unimodal degradation
dense-caption data
Innovation

Methods, ideas, or system contributions that make the work stand out.

unified multi-task generation
audio-video alignment
single-tower DiT architecture
progressive multitask training
dense-captioned AV dataset
🔎 Similar Papers
No similar papers found.
J
Jun Wang
Kling Team, Kuaishou Technology
Chunyu Qiang
Chunyu Qiang
Kuaishou Technology; TJU; CASIA
Speech Synthesis
Y
Yuxin Guo
Kling Team, Kuaishou Technology
Y
Yiran Wang
Kling Team, Kuaishou Technology
X
Xijuan Zeng
Kling Team, Kuaishou Technology
C
Chen Zhang
Kling Team, Kuaishou Technology
Pengfei Wan
Pengfei Wan
Head of Kling Video Generation Models, Kuaishou Technology
Generative ModelsComputer VisionMultimodal AIComputer Graphics