MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation

📅 2026-04-21
📈 Citations: 0
Influential: 0
📄 PDF

career value

196K/year
🤖 AI Summary
Existing audio-visual joint generation methods struggle to achieve cross-modal alignment and fine-grained controllability. To address this, this work proposes MMControl, a novel framework that, for the first time, enables joint conditioning on multiple modalities—including reference images, reference audio, depth maps, and pose sequences. Built upon a diffusion Transformer architecture, MMControl introduces a dual-stream conditional injection mechanism with bypass branches and a modality-specific guidance scaling strategy, allowing dynamic adjustment of each modality’s influence during inference. Experiments demonstrate that MMControl achieves composable and highly consistent generation of character identity, voice timbre, body pose, and scene layout, significantly enhancing both controllability and cross-modal alignment in audio-visual synthesis.

Technology Category

Application Category

📝 Abstract
Recent advances in Diffusion Transformers (DiTs) have enabled high-quality joint audio-video generation, producing videos with synchronized audio within a single model. However, existing controllable generation frameworks are typically restricted to video-only control. This restricts comprehensive controllability and often leads to suboptimal cross-modal alignment. To bridge this gap, we present MMControl, which enables users to perform Multi-Modal Control in joint audio-video generation. MMControl introduces a dual-stream conditional injection mechanism. It incorporates both visual and acoustic control signals, including reference images, reference audio, depth maps, and pose sequences, into a joint generation process. These conditions are injected through bypass branches into a joint audio-video Diffusion Transformer, enabling the model to simultaneously generate identity-consistent video and timbre-consistent audio under structural constraints. Furthermore, we introduce modality-specific guidance scaling, which allows users to independently and dynamically adjust the influence strength of each visual and acoustic condition at inference time. Extensive experiments demonstrate that MMControl achieves fine-grained, composable control over character identity, voice timbre, body pose, and scene layout in joint audio-video generation.
Problem

Research questions and friction points this paper is trying to address.

multi-modal control
audio-video generation
cross-modal alignment
controllable generation
joint generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Modal Control
Diffusion Transformer
Joint Audio-Video Generation
Conditional Injection
Modality-Specific Guidance