MOVA: Towards Scalable and Synchronized Video-Audio Generation

πŸ“… 2026-02-09
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitations of existing video generation models, which often neglect audio and rely on cascaded pipelines, leading to high computational costs, error propagation, and audio-visual desynchronization. To overcome these challenges, we propose MOVAβ€”the first open-source, end-to-end image-and-text-to-video-and-audio (IT2VA) generation model. Built upon a 32B-parameter Mixture-of-Experts architecture (with 18B activated per forward pass), MOVA supports LoRA-based fine-tuning, efficient inference, and prompt enhancement. It simultaneously generates semantically aligned, high-fidelity video and audio, including lip-synced speech, contextually appropriate sound effects, and background music. By releasing the model weights and a complete toolchain, this work aims to advance research in joint audio-visual synthesis.

Technology Category

Application Category

πŸ“ Abstract
Audio is indispensable for real-world video, yet generation models have largely overlooked audio components. Current approaches to producing audio-visual content often rely on cascaded pipelines, which increase cost, accumulate errors, and degrade overall quality. While systems such as Veo 3 and Sora 2 emphasize the value of simultaneous generation, joint multimodal modeling introduces unique challenges in architecture, data, and training. Moreover, the closed-source nature of existing systems limits progress in the field. In this work, we introduce MOVA (MOSS Video and Audio), an open-source model capable of generating high-quality, synchronized audio-visual content, including realistic lip-synced speech, environment-aware sound effects, and content-aligned music. MOVA employs a Mixture-of-Experts (MoE) architecture, with a total of 32B parameters, of which 18B are active during inference. It supports IT2VA (Image-Text to Video-Audio) generation task. By releasing the model weights and code, we aim to advance research and foster a vibrant community of creators. The released codebase features comprehensive support for efficient inference, LoRA fine-tuning, and prompt enhancement.
Problem

Research questions and friction points this paper is trying to address.

video-audio generation
multimodal modeling
synchronization
scalability
open-source
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
synchronized audio-visual generation
IT2VA
open-source multimodal model
efficient inference
πŸ”Ž Similar Papers
No similar papers found.
S
SII-OpenMOSS Team
D
Donghua Yu
M
Mingshu Chen
Q
Qi Chen
Q
Qi Luo
Qianyi Wu
Qianyi Wu
Monash University
Computer VisionComputer Graphics3D Vision
Q
Qinyuan Cheng
R
Ruixiao Li
Tianyi Liang
Tianyi Liang
PHD, East China Normal University, Shanghai AI Lab,Shanghai Innovation Institute
Multimodal LearningLLMsImage Editing
W
Wenbo Zhang
W
Wenming Tu
X
Xiangyu Peng
Y
Yang Gao
Y
Yanru Huo
Y
Ying Zhu
Y
Yinze Luo
Y
Yiyang Zhang
Y
Yuerong Song
Z
Zhe Xu
Zhiyu Zhang
Zhiyu Zhang
Postdoc, Carnegie Mellon University
Machine LearningOptimizationStatistics
C
Chenchen Yang
C
Cheng Chang
C
Chushu Zhou
H
Hanfu Chen