MM-Sonate: Multimodal Controllable Audio-Video Generation with Zero-Shot Voice Cloning

📅 2026-01-04
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing unified audio-visual generation models face limitations in fine-grained acoustic control, speaker identity preservation, temporal alignment, and zero-shot voice cloning. To address these challenges, this work proposes MM-Sonate, a flow-matching-based multimodal generative framework that achieves strict audio-visual synchronization through joint instruction-phoneme encoding and disentangles speaker identity from linguistic content via a timbre injection mechanism. Furthermore, a negative conditional generation strategy grounded in a noise prior is introduced to enhance acoustic fidelity. Experimental results demonstrate that MM-Sonate significantly outperforms current methods in lip-sync accuracy, speech intelligibility, and zero-shot cloning fidelity, establishing a new state of the art on joint audio-visual generation benchmarks.

Technology Category

Application Category

📝 Abstract
Joint audio-video generation aims to synthesize synchronized multisensory content, yet current unified models struggle with fine-grained acoustic control, particularly for identity-preserving speech. Existing approaches either suffer from temporal misalignment due to cascaded generation or lack the capability to perform zero-shot voice cloning within a joint synthesis framework. In this work, we present MM-Sonate, a multimodal flow-matching framework that unifies controllable audio-video joint generation with zero-shot voice cloning capabilities. Unlike prior works that rely on coarse semantic descriptions, MM-Sonate utilizes a unified instruction-phoneme input to enforce strict linguistic and temporal alignment. To enable zero-shot voice cloning, we introduce a timbre injection mechanism that effectively decouples speaker identity from linguistic content. Furthermore, addressing the limitations of standard classifier-free guidance in multimodal settings, we propose a noise-based negative conditioning strategy that utilizes natural noise priors to significantly enhance acoustic fidelity. Empirical evaluations demonstrate that MM-Sonate establishes new state-of-the-art performance in joint generation benchmarks, significantly outperforming baselines in lip synchronization and speech intelligibility, while achieving voice cloning fidelity comparable to specialized Text-to-Speech systems.
Problem

Research questions and friction points this paper is trying to address.

joint audio-video generation
zero-shot voice cloning
temporal alignment
speaker identity preservation
multimodal synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal generation
zero-shot voice cloning
flow matching
timbre injection
noise-based conditioning
🔎 Similar Papers
No similar papers found.
Chunyu Qiang
Chunyu Qiang
Kuaishou Technology; TJU; CASIA
Speech Synthesis
J
Jun Wang
Kling Team, Kuaishou Technology
Xiaopeng Wang
Xiaopeng Wang
Institute of Automation, Chinese Academy of Sciences
Fake Audio DetectionText To SpeechSpeech Large Model
K
Kang Yin
Kling Team, Kuaishou Technology
Y
Yuxin Guo
Kling Team, Kuaishou Technology
X
Xijuan Zeng
Kling Team, Kuaishou Technology
N
Nan Li
Kling Team, Kuaishou Technology
Zihan Li
Zihan Li
University of Washington
Foundation ModelAI for HealthcareMultimodal Learning
Yuzhe Liang
Yuzhe Liang
Shanghai Jiao Tong University
Deep learningMultimodal Learning
Z
Ziyu Zhang
Kling Team, Kuaishou Technology
T
Teng Ma
Kling Team, Kuaishou Technology
Yushen Chen
Yushen Chen
Shanghai Jiao Tong University
Speech and Language Processing
Z
Zhongliang Liu
Kling Team, Kuaishou Technology
F
Feng Deng
Kling Team, Kuaishou Technology
C
Chen Zhang
Kling Team, Kuaishou Technology
Pengfei Wan
Pengfei Wan
Head of Kling Video Generation Models, Kuaishou Technology
Generative ModelsComputer VisionMultimodal AIComputer Graphics