AV-Dialog: Spoken Dialogue Models with Audio-Visual Input

📅 2025-11-14

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

Conversational systems often produce irrelevant responses and exhibit turn-taking disorganization in multi-speaker noisy environments. To address this, we propose the first end-to-end audio-visual fusion framework for spoken dialogue modeling, jointly leveraging acoustic signals and visual cues—including gaze direction and speech activity—to enable target speaker tracking, streaming speaker-aware transcription, semantically aligned turn boundary detection, and coherent response generation. Our approach introduces a novel multi-task, multi-stage training paradigm that integrates acoustic tokenization with a multimodal fusion network, jointly optimized across single-speaker, synthetic, and real-world audiovisual datasets. Experiments demonstrate significant reductions in transcription error rates, improved turn-switch prediction accuracy, and higher human-rated dialogue fluency scores. These results validate the critical role of audio-visual synergy in building robust and natural conversational systems.

Technology Category

Application Category

📝 Abstract

Dialogue models falter in noisy, multi-speaker environments, often producing irrelevant responses and awkward turn-taking. We present AV-Dialog, the first multimodal dialog framework that uses both audio and visual cues to track the target speaker, predict turn-taking, and generate coherent responses. By combining acoustic tokenization with multi-task, multi-stage training on monadic, synthetic, and real audio-visual dialogue datasets, AV-Dialog achieves robust streaming transcription, semantically grounded turn-boundary detection and accurate responses, resulting in a natural conversational flow. Experiments show that AV-Dialog outperforms audio-only models under interference, reducing transcription errors, improving turn-taking prediction, and enhancing human-rated dialogue quality. These results highlight the power of seeing as well as hearing for speaker-aware interaction, paving the way for {spoken} dialogue agents that perform {robustly} in real-world, noisy environments.

Problem

Research questions and friction points this paper is trying to address.

Addresses dialogue model failures in noisy multi-speaker environments

Develops multimodal framework using audio-visual cues for speaker tracking

Enhances turn-taking prediction and response coherence through multimodal training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses audio-visual cues for speaker tracking

Combines acoustic tokenization with multi-task training

Achieves robust transcription and turn-taking prediction

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs