JoVA: Unified Multimodal Learning for Joint Video-Audio Generation

📅 2025-12-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing joint video-audio generation methods suffer from two key limitations: poor lip-sync accuracy in generated speech and reliance on explicit alignment modules, which compromise the architectural elegance of Transformers. This paper proposes an end-to-end joint generation framework featuring a novel cross-modal joint self-attention mechanism, enabling direct interaction between video and audio tokens at every Transformer layer—eliminating the need for explicit alignment or dedicated modality fusion modules. Additionally, we introduce a landmark-guided mouth-region loss to enhance lip-sync precision without increasing parameter count. By unifying multimodal tokenization and training within a single Transformer, our approach significantly improves lip-sync accuracy, speech naturalness, and audiovisual fidelity. It outperforms or matches state-of-the-art unified and audio-driven methods across multiple quantitative metrics, demonstrating that high-fidelity multimodal generation is achievable with a streamlined, alignment-free architecture.

Technology Category

Application Category

📝 Abstract
In this paper, we present JoVA, a unified framework for joint video-audio generation. Despite recent encouraging advances, existing methods face two critical limitations. First, most existing approaches can only generate ambient sounds and lack the capability to produce human speech synchronized with lip movements. Second, recent attempts at unified human video-audio generation typically rely on explicit fusion or modality-specific alignment modules, which introduce additional architecture design and weaken the model simplicity of the original transformers. To address these issues, JoVA employs joint self-attention across video and audio tokens within each transformer layer, enabling direct and efficient cross-modal interaction without the need for additional alignment modules. Furthermore, to enable high-quality lip-speech synchronization, we introduce a simple yet effective mouth-area loss based on facial keypoint detection, which enhances supervision on the critical mouth region during training without compromising architectural simplicity. Extensive experiments on benchmarks demonstrate that JoVA outperforms or is competitive with both unified and audio-driven state-of-the-art methods in lip-sync accuracy, speech quality, and overall video-audio generation fidelity. Our results establish JoVA as an elegant framework for high-quality multimodal generation.
Problem

Research questions and friction points this paper is trying to address.

Generates synchronized video-audio with human speech
Eliminates need for explicit cross-modal alignment modules
Enhances lip-sync via mouth-area loss without added complexity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint self-attention enables direct cross-modal interaction
Mouth-area loss enhances lip-speech synchronization
Simplified transformer architecture maintains model simplicity
🔎 Similar Papers
No similar papers found.
Xiaohu Huang
Xiaohu Huang
The University of HongKong
computer visionvideo analysis
H
Hao Zhou
ByteDance
Q
Qiangpeng Yang
ByteDance
Shilei Wen
Shilei Wen
bytedance.com
computer visionmachine learning
K
Kai Han
The University of Hong Kong