Does Hearing Help Seeing? Investigating Audio-Video Joint Denoising for Video Generation

📅 2025-12-02

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This study investigates whether audio signals can serve as privileged information to enhance purely video-based generation quality. To this end, we propose AVFullDiT—a parameter-efficient architecture that leverages pretrained text-to-video (T2V) and text-to-audio (T2A) models for joint audio-visual denoising training. Our core contribution is the first systematic empirical validation that cross-modal co-training enables modeling of audio-visual causal relationships, thereby imposing physics-aware consistency regularization on video dynamics. Experiments demonstrate that AVFullDiT significantly outperforms unimodal baselines in challenging scenarios involving large motions and object interactions, achieving consistent improvements across multiple video generation metrics—including FVD, LPIPS, and motion consistency scores. These results substantiate the efficacy and generalizability of audio-augmented visual generation, highlighting its potential for improving physical plausibility and temporal coherence in diffusion-based video synthesis.

Technology Category

Application Category

📝 Abstract

Recent audio-video generative systems suggest that coupling modalities benefits not only audio-video synchrony but also the video modality itself. We pose a fundamental question: Does audio-video joint denoising training improve video generation, even when we only care about video quality? To study this, we introduce a parameter-efficient Audio-Video Full DiT (AVFullDiT) architecture that leverages pre-trained text-to-video (T2V) and text-to-audio (T2A) modules for joint denoising. We train (i) a T2AV model with AVFullDiT and (ii) a T2V-only counterpart under identical settings. Our results provide the first systematic evidence that audio-video joint denoising can deliver more than synchrony. We observe consistent improvements on challenging subsets featuring large and object contact motions. We hypothesize that predicting audio acts as a privileged signal, encouraging the model to internalize causal relationships between visual events and their acoustic consequences (e.g., collision $ imes$ impact sound), which in turn regularizes video dynamics. Our findings suggest that cross-modal co-training is a promising approach to developing stronger, more physically grounded world models. Code and dataset will be made publicly available.

Problem

Research questions and friction points this paper is trying to address.

Investigates if audio-video joint denoising improves video generation quality

Examines audio as a privileged signal for causal visual-acoustic relationships

Explores cross-modal co-training for physically grounded world models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Audio-video joint denoising training improves video generation

Parameter-efficient AVFullDiT leverages pre-trained T2V and T2A modules

Audio acts as privileged signal to regularize video dynamics

🔎 Similar Papers

Video-to-Audio Generation with Hidden Alignment