In-Context Audio Control of Video Diffusion Transformers

📅 2025-12-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing video diffusion Transformers, which neglect audio–temporal synchronization signals. We propose the first speech-driven video generation framework that directly injects raw audio waveforms as temporal conditioning into a unified architecture. Our core innovation is Masked 3D Attention—a novel 3D self-attention mechanism imposing explicit time-alignment constraints to mitigate training instability in cross-modal spatiotemporal modeling. We further integrate multi-stage audio feature alignment and diffusion distillation. Compared to baseline methods using cross-attention or 2D/3D self-attention for audio injection—and without explicit lip-motion loss or post-processing—our approach achieves significant improvements: 38% lower Lip Synchronization Error (LSE) and 22% lower Fréchet Video Distance (FVD), while supporting reference-image guidance and arbitrary speaker personalization.

Technology Category

Application Category

📝 Abstract
Recent advancements in video generation have seen a shift towards unified, transformer-based foundation models that can handle multiple conditional inputs in-context. However, these models have primarily focused on modalities like text, images, and depth maps, while strictly time-synchronous signals like audio have been underexplored. This paper introduces In-Context Audio Control of video diffusion transformers (ICAC), a framework that investigates the integration of audio signals for speech-driven video generation within a unified full-attention architecture, akin to FullDiT. We systematically explore three distinct mechanisms for injecting audio conditions: standard cross-attention, 2D self-attention, and unified 3D self-attention. Our findings reveal that while 3D attention offers the highest potential for capturing spatio-temporal audio-visual correlations, it presents significant training challenges. To overcome this, we propose a Masked 3D Attention mechanism that constrains the attention pattern to enforce temporal alignment, enabling stable training and superior performance. Our experiments demonstrate that this approach achieves strong lip synchronization and video quality, conditioned on an audio stream and reference images.
Problem

Research questions and friction points this paper is trying to address.

Integrates audio signals for speech-driven video generation
Explores mechanisms for injecting audio conditions into transformers
Proposes Masked 3D Attention to ensure temporal alignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified transformer integrates audio for video generation
Masked 3D Attention ensures temporal alignment and stability
Three audio injection mechanisms explored for synchronization
🔎 Similar Papers
No similar papers found.