AV-Dialog: Spoken Dialogue Models with Audio-Visual Input

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Conversational systems often produce irrelevant responses and exhibit turn-taking disorganization in multi-speaker noisy environments. To address this, we propose the first end-to-end audio-visual fusion framework for spoken dialogue modeling, jointly leveraging acoustic signals and visual cues—including gaze direction and speech activity—to enable target speaker tracking, streaming speaker-aware transcription, semantically aligned turn boundary detection, and coherent response generation. Our approach introduces a novel multi-task, multi-stage training paradigm that integrates acoustic tokenization with a multimodal fusion network, jointly optimized across single-speaker, synthetic, and real-world audiovisual datasets. Experiments demonstrate significant reductions in transcription error rates, improved turn-switch prediction accuracy, and higher human-rated dialogue fluency scores. These results validate the critical role of audio-visual synergy in building robust and natural conversational systems.

Technology Category

Application Category

📝 Abstract
Dialogue models falter in noisy, multi-speaker environments, often producing irrelevant responses and awkward turn-taking. We present AV-Dialog, the first multimodal dialog framework that uses both audio and visual cues to track the target speaker, predict turn-taking, and generate coherent responses. By combining acoustic tokenization with multi-task, multi-stage training on monadic, synthetic, and real audio-visual dialogue datasets, AV-Dialog achieves robust streaming transcription, semantically grounded turn-boundary detection and accurate responses, resulting in a natural conversational flow. Experiments show that AV-Dialog outperforms audio-only models under interference, reducing transcription errors, improving turn-taking prediction, and enhancing human-rated dialogue quality. These results highlight the power of seeing as well as hearing for speaker-aware interaction, paving the way for {spoken} dialogue agents that perform {robustly} in real-world, noisy environments.
Problem

Research questions and friction points this paper is trying to address.

Addresses dialogue model failures in noisy multi-speaker environments
Develops multimodal framework using audio-visual cues for speaker tracking
Enhances turn-taking prediction and response coherence through multimodal training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses audio-visual cues for speaker tracking
Combines acoustic tokenization with multi-task training
Achieves robust transcription and turn-taking prediction
🔎 Similar Papers
No similar papers found.