MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation

📅 2026-04-21

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

Existing audio-visual joint generation methods struggle to achieve cross-modal alignment and fine-grained controllability. To address this, this work proposes MMControl, a novel framework that, for the first time, enables joint conditioning on multiple modalities—including reference images, reference audio, depth maps, and pose sequences. Built upon a diffusion Transformer architecture, MMControl introduces a dual-stream conditional injection mechanism with bypass branches and a modality-specific guidance scaling strategy, allowing dynamic adjustment of each modality’s influence during inference. Experiments demonstrate that MMControl achieves composable and highly consistent generation of character identity, voice timbre, body pose, and scene layout, significantly enhancing both controllability and cross-modal alignment in audio-visual synthesis.

Technology Category

Application Category

📝 Abstract

Recent advances in Diffusion Transformers (DiTs) have enabled high-quality joint audio-video generation, producing videos with synchronized audio within a single model. However, existing controllable generation frameworks are typically restricted to video-only control. This restricts comprehensive controllability and often leads to suboptimal cross-modal alignment. To bridge this gap, we present MMControl, which enables users to perform Multi-Modal Control in joint audio-video generation. MMControl introduces a dual-stream conditional injection mechanism. It incorporates both visual and acoustic control signals, including reference images, reference audio, depth maps, and pose sequences, into a joint generation process. These conditions are injected through bypass branches into a joint audio-video Diffusion Transformer, enabling the model to simultaneously generate identity-consistent video and timbre-consistent audio under structural constraints. Furthermore, we introduce modality-specific guidance scaling, which allows users to independently and dynamically adjust the influence strength of each visual and acoustic condition at inference time. Extensive experiments demonstrate that MMControl achieves fine-grained, composable control over character identity, voice timbre, body pose, and scene layout in joint audio-video generation.

Problem

Research questions and friction points this paper is trying to address.

multi-modal control

audio-video generation

cross-modal alignment

controllable generation

joint generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Modal Control

Diffusion Transformer

Joint Audio-Video Generation