3MDiT: Unified Tri-Modal Diffusion Transformer for Text-Driven Synchronized Audio-Video Generation

πŸ“… 2025-11-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Current text-to-video (T2V) models predominantly generate silent videos; joint audiovisual generation typically relies on cascaded pipelines or dual-branch architectures, leading to accumulated cross-modal errors, poor temporal synchronization, and limited reusability of pretrained T2V backbones. To address these limitations, we propose OmniAVβ€”the first unified tri-modal diffusion Transformer framework for synchronized audiovisual generation conditioned on text. Our key contributions are: (1) an omni-block enabling dynamic, feature-level interaction among audio, video, and text; (2) a homomorphic audio branch coupled with dynamic text reweighting to ensure temporal equivariance and semantic co-evolution across modalities; and (3) an orthogonal adaptation design enabling zero-modification integration of pretrained T2V models. Experiments demonstrate that OmniAV significantly outperforms existing methods in audiovisual synchronization, tri-modal alignment, and generation fidelity.

Technology Category

Application Category

πŸ“ Abstract
Text-to-video (T2V) diffusion models have recently achieved impressive visual quality, yet most systems still generate silent clips and treat audio as a secondary concern. Existing audio-video generation pipelines typically decompose the task into cascaded stages, which accumulate errors across modalities and are trained under separate objectives. Recent joint audio-video generators alleviate this issue but often rely on dual-tower architectures with ad-hoc cross-modal bridges and static, single-shot text conditioning, making it difficult to both reuse T2V backbones and to reason about how audio, video and language interact over time. To address these challenges, we propose 3MDiT, a unified tri-modal diffusion transformer for text-driven synchronized audio-video generation. Our framework models video, audio and text as jointly evolving streams: an isomorphic audio branch mirrors a T2V backbone, tri-modal omni-blocks perform feature-level fusion across the three modalities, and an optional dynamic text conditioning mechanism updates the text representation as audio and video evidence co-evolve. The design supports two regimes: training from scratch on audio-video data, and orthogonally adapting a pretrained T2V model without modifying its backbone. Experiments show that our approach generates high-quality videos and realistic audio while consistently improving audio-video synchronization and tri-modal alignment across a range of quantitative metrics.
Problem

Research questions and friction points this paper is trying to address.

Generates synchronized audio-video from text
Unifies audio, video, and text in a single model
Improves multimodal alignment and synchronization quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified tri-modal diffusion transformer for synchronized audio-video generation
Tri-modal omni-blocks enable feature-level fusion across modalities
Dynamic text conditioning updates text representation as audio-video co-evolve
πŸ”Ž Similar Papers
No similar papers found.
Yaoru Li
Yaoru Li
Zhejiang University, Huawei Technologies
LLM Agents
H
Heyu Si
Zhejiang University
Federico Landi
Federico Landi
Huawei Technologies
Computer VisionDeep Learning
P
Pilar Oplustil Gallegos
Huawei
I
Ioannis Koutsoumpas
Huawei
O
O. Ricardo Cortez Vazquez
Huawei
R
Ruiju Fu
Huawei
Q
Qi Guo
Huawei
X
Xin Jin
Huawei
Shunyu Liu
Shunyu Liu
Nanyang Technological University
Multi-Agent LearningReinforcement LearningLarge Language ModelsPower System Control
M
Mingli Song
Zhejiang University