JoVA: Unified Multimodal Learning for Joint Video-Audio Generation

📅 2025-12-15

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Existing joint video-audio generation methods suffer from two key limitations: poor lip-sync accuracy in generated speech and reliance on explicit alignment modules, which compromise the architectural elegance of Transformers. This paper proposes an end-to-end joint generation framework featuring a novel cross-modal joint self-attention mechanism, enabling direct interaction between video and audio tokens at every Transformer layer—eliminating the need for explicit alignment or dedicated modality fusion modules. Additionally, we introduce a landmark-guided mouth-region loss to enhance lip-sync precision without increasing parameter count. By unifying multimodal tokenization and training within a single Transformer, our approach significantly improves lip-sync accuracy, speech naturalness, and audiovisual fidelity. It outperforms or matches state-of-the-art unified and audio-driven methods across multiple quantitative metrics, demonstrating that high-fidelity multimodal generation is achievable with a streamlined, alignment-free architecture.

Technology Category

Application Category

📝 Abstract

In this paper, we present JoVA, a unified framework for joint video-audio generation. Despite recent encouraging advances, existing methods face two critical limitations. First, most existing approaches can only generate ambient sounds and lack the capability to produce human speech synchronized with lip movements. Second, recent attempts at unified human video-audio generation typically rely on explicit fusion or modality-specific alignment modules, which introduce additional architecture design and weaken the model simplicity of the original transformers. To address these issues, JoVA employs joint self-attention across video and audio tokens within each transformer layer, enabling direct and efficient cross-modal interaction without the need for additional alignment modules. Furthermore, to enable high-quality lip-speech synchronization, we introduce a simple yet effective mouth-area loss based on facial keypoint detection, which enhances supervision on the critical mouth region during training without compromising architectural simplicity. Extensive experiments on benchmarks demonstrate that JoVA outperforms or is competitive with both unified and audio-driven state-of-the-art methods in lip-sync accuracy, speech quality, and overall video-audio generation fidelity. Our results establish JoVA as an elegant framework for high-quality multimodal generation.

Problem

Research questions and friction points this paper is trying to address.

Generates synchronized video-audio with human speech

Eliminates need for explicit cross-modal alignment modules

Enhances lip-sync via mouth-area loss without added complexity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint self-attention enables direct cross-modal interaction

Mouth-area loss enhances lip-speech synchronization

Simplified transformer architecture maintains model simplicity

🔎 Similar Papers

MMDisCo: Multi-Modal Discriminator-Guided Cooperative Diffusion for Joint Audio and Video Generation

2024-05-28Citations: 3