Apollo: Unified Multi-Task Audio-Video Joint Generation

📅 2026-01-07

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses key challenges in audio-visual joint generation—namely, audio-visual asynchrony, poor lip-sync alignment, and performance degradation in individual modalities—by proposing a unified single-tower DiT architecture. The approach integrates an Omni-Full Attention mechanism with a multi-task random modality masking strategy and employs a multi-stage curriculum training pipeline. Additionally, the authors introduce the first large-scale audio-visual-text triplet dataset featuring dense caption annotations. The proposed method substantially outperforms existing models across multiple tasks, achieving performance comparable to Veo 3 while demonstrating strong generalization and cross-modal temporal-semantic consistency.

Technology Category

Application Category

📝 Abstract

Audio-video joint generation has progressed rapidly, yet substantial challenges still remain. Non-commercial approaches still suffer audio-visual asynchrony, poor lip-speech alignment, and unimodal degradation, which can be stemmed from weak audio-visual correspondence modeling, limited generalization, and scarce high-quality dense-caption data. To address these issues, we introduce Apollo and delve into three axes--model architecture, training strategy, and data curation. Architecturally, we adopt a single-tower design with unified DiT blocks and an Omni-Full Attention mechanism, achieving tight audio-visual alignment and strong scalability. Training-wise, we adopt a progressive multitask regime--random modality masking to joint optimization across tasks, and a multistage curriculum, yielding robust representations, strengthening A-V aligned world knowledge, and preventing unimodal collapse. For datasets, we present the first large-scale audio-video dataset with dense captions, and introduce a novel automated data-construction pipeline which annotates and filters millions of diverse, high-quality, strictly aligned audio-video-caption triplets. Building on this, Apollo scales to large datasets, delivering high-fidelity, semantically and temporally aligned, instruction-following generation in both joint and unimodal settings while generalizing robustly to out-of-distribution scenarios. Across tasks, it substantially outperforms prior methods by a large margin and achieves performance comparable to Veo 3, offering a unified, scalable path toward next-generation audio-video synthesis.

Problem

Research questions and friction points this paper is trying to address.

audio-video joint generation

audio-visual asynchrony

lip-speech alignment

unimodal degradation

dense-caption data

Innovation

Methods, ideas, or system contributions that make the work stand out.

unified multi-task generation

audio-video alignment

single-tower DiT architecture