🤖 AI Summary
Conversational systems often produce irrelevant responses and exhibit turn-taking disorganization in multi-speaker noisy environments. To address this, we propose the first end-to-end audio-visual fusion framework for spoken dialogue modeling, jointly leveraging acoustic signals and visual cues—including gaze direction and speech activity—to enable target speaker tracking, streaming speaker-aware transcription, semantically aligned turn boundary detection, and coherent response generation. Our approach introduces a novel multi-task, multi-stage training paradigm that integrates acoustic tokenization with a multimodal fusion network, jointly optimized across single-speaker, synthetic, and real-world audiovisual datasets. Experiments demonstrate significant reductions in transcription error rates, improved turn-switch prediction accuracy, and higher human-rated dialogue fluency scores. These results validate the critical role of audio-visual synergy in building robust and natural conversational systems.
📝 Abstract
Dialogue models falter in noisy, multi-speaker environments, often producing irrelevant responses and awkward turn-taking. We present AV-Dialog, the first multimodal dialog framework that uses both audio and visual cues to track the target speaker, predict turn-taking, and generate coherent responses. By combining acoustic tokenization with multi-task, multi-stage training on monadic, synthetic, and real audio-visual dialogue datasets, AV-Dialog achieves robust streaming transcription, semantically grounded turn-boundary detection and accurate responses, resulting in a natural conversational flow. Experiments show that AV-Dialog outperforms audio-only models under interference, reducing transcription errors, improving turn-taking prediction, and enhancing human-rated dialogue quality. These results highlight the power of seeing as well as hearing for speaker-aware interaction, paving the way for {spoken} dialogue agents that perform {robustly} in real-world, noisy environments.