Improving Joint Audio-Video Generation with Cross-Modal Context Learning

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This work addresses key challenges in audio-visual joint generation with dual-stream Transformers, including unstable cross-modal interactions, multimodal background bias, and inconsistency in classifier-free guidance (CFG) between training and inference. To this end, the authors propose a Cross-modal Context Learning (CCL) framework that integrates temporally aligned Rotary Position Embeddings (RoPE) with a chunking mechanism, learnable context tokens, dynamic context routing, and unconditional context guidance (UCG). This framework establishes, for the first time, a dynamic yet stable unconditional anchor for cross-modal generation, significantly enhancing temporal alignment, multi-condition coordination, and training-inference consistency. Experimental results demonstrate that the proposed method achieves state-of-the-art performance across multiple metrics, substantially outperforming existing approaches while reducing computational overhead.

Technology Category

Application Category

📝 Abstract

The dual-stream transformer architecture-based joint audio-video generation method has become the dominant paradigm in current research. By incorporating pre-trained video diffusion models and audio diffusion models, along with a cross-modal interaction attention module, high-quality, temporally synchronized audio-video content can be generated with minimal training data. In this paper, we first revisit the dual-stream transformer paradigm and further analyze its limitations, including model manifold variations caused by the gating mechanism controlling cross-modal interactions, biases in multi-modal background regions introduced by cross-modal attention, and the inconsistencies in multi-modal classifier-free guidance (CFG) during training and inference, as well as conflicts between multiple conditions. To alleviate these issues, we propose Cross-Modal Context Learning (CCL), equipped with several carefully designed modules. Temporally Aligned RoPE and Partitioning (TARP) effectively enhances the temporal alignment between audio latent and video latent representations. The Learnable Context Tokens (LCT) and Dynamic Context Routing (DCR) in the Cross-Modal Context Attention (CCA) module provide stable unconditional anchors for cross-modal information, while dynamically routing based on different training tasks, further enhancing the model's convergence speed and generation quality. During inference, Unconditional Context Guidance (UCG) leverages the unconditional support provided by LCT to facilitate different forms of CFG, improving train-inference consistency and further alleviating conflicts. Through comprehensive evaluations, CCL achieves state-of-the-art performance compared with recent academic methods while requiring substantially fewer resources.

Problem

Research questions and friction points this paper is trying to address.

audio-video generation

cross-modal attention

classifier-free guidance

temporal alignment

dual-stream transformer

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Modal Context Learning

Temporal Alignment

Learnable Context Tokens