🤖 AI Summary
This study investigates whether audio signals can serve as privileged information to enhance purely video-based generation quality. To this end, we propose AVFullDiT—a parameter-efficient architecture that leverages pretrained text-to-video (T2V) and text-to-audio (T2A) models for joint audio-visual denoising training. Our core contribution is the first systematic empirical validation that cross-modal co-training enables modeling of audio-visual causal relationships, thereby imposing physics-aware consistency regularization on video dynamics. Experiments demonstrate that AVFullDiT significantly outperforms unimodal baselines in challenging scenarios involving large motions and object interactions, achieving consistent improvements across multiple video generation metrics—including FVD, LPIPS, and motion consistency scores. These results substantiate the efficacy and generalizability of audio-augmented visual generation, highlighting its potential for improving physical plausibility and temporal coherence in diffusion-based video synthesis.
📝 Abstract
Recent audio-video generative systems suggest that coupling modalities benefits not only audio-video synchrony but also the video modality itself. We pose a fundamental question: Does audio-video joint denoising training improve video generation, even when we only care about video quality? To study this, we introduce a parameter-efficient Audio-Video Full DiT (AVFullDiT) architecture that leverages pre-trained text-to-video (T2V) and text-to-audio (T2A) modules for joint denoising. We train (i) a T2AV model with AVFullDiT and (ii) a T2V-only counterpart under identical settings. Our results provide the first systematic evidence that audio-video joint denoising can deliver more than synchrony. We observe consistent improvements on challenging subsets featuring large and object contact motions. We hypothesize that predicting audio acts as a privileged signal, encouraging the model to internalize causal relationships between visual events and their acoustic consequences (e.g., collision $ imes$ impact sound), which in turn regularizes video dynamics. Our findings suggest that cross-modal co-training is a promising approach to developing stronger, more physically grounded world models. Code and dataset will be made publicly available.