MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation

📅 2025-12-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing audiovisual dialogue systems are predominantly non-interactive, suffering from low speech naturalness, short temporal duration, and audio–visual asynchrony. To address these limitations, we propose a novel two-stage Conductor-Creator architecture that enables long-duration, high-fidelity, and identity-consistent multimodal audiovisual dialogue understanding and generation for the first time. The Conductor module employs a dual Diffusion Transformer (DiT) structure to jointly model cross-modal semantics and contextual continuity, while the Creator module synergistically integrates autoregressive audio generation with diffusion-based video synthesis, augmented by a novel fusion module that explicitly enforces audio–visual synchronization and speaker identity consistency. Extensive experiments demonstrate that our method significantly improves semantic coherence, audio–visual alignment accuracy, and interactive naturalness, achieving state-of-the-art performance on long-horizon dialogue generation tasks.

Technology Category

Application Category

📝 Abstract
We propose MAViD, a novel Multimodal framework for Audio-Visual Dialogue understanding and generation. Existing approaches primarily focus on non-interactive systems and are limited to producing constrained and unnatural human speech.The primary challenge of this task lies in effectively integrating understanding and generation capabilities, as well as achieving seamless multimodal audio-video fusion. To solve these problems, we propose a Conductor-Creator architecture that divides the dialogue system into two primary components.The Conductor is tasked with understanding, reasoning, and generating instructions by breaking them down into motion and speech components, thereby enabling fine-grained control over interactions. The Creator then delivers interactive responses based on these instructions.Furthermore, to address the difficulty of generating long videos with consistent identity, timbre, and tone using dual DiT structures, the Creator adopts a structure that combines autoregressive (AR) and diffusion models. The AR model is responsible for audio generation, while the diffusion model ensures high-quality video generation.Additionally, we propose a novel fusion module to enhance connections between contextually consecutive clips and modalities, enabling synchronized long-duration audio-visual content generation.Extensive experiments demonstrate that our framework can generate vivid and contextually coherent long-duration dialogue interactions and accurately interpret users'multimodal queries.
Problem

Research questions and friction points this paper is trying to address.

Integrates multimodal audio-video understanding and generation
Enables natural, long-duration dialogue interactions with fine-grained control
Ensures consistent identity and synchronization in generated content
Innovation

Methods, ideas, or system contributions that make the work stand out.

Conductor-Creator architecture for multimodal dialogue control
AR and diffusion models for synchronized audio-video generation
Novel fusion module for consistent long-duration content
🔎 Similar Papers
No similar papers found.
Y
Youxin Pang
Tsinghua University, Meituan
J
Jiajun Liu
Meituan
L
Lingfeng Tan
Meituan
Y
Yong Zhang
Meituan
F
Feng Gao
Meituan
Xiang Deng
Xiang Deng
Scale AI
Machine LearningNLPKnowledge GraphsSemantic Web
Z
Zhuoliang Kang
Meituan
Xiaoming Wei
Xiaoming Wei
Meituan
computer visionmachine learning
Yebin Liu
Yebin Liu
Professor, Tsinghua University
Computer GraphicsComputational Photography3D VisionDigital Humans