🤖 AI Summary
This work addresses core challenges in transitioning speech dialogue systems from cascaded ASR/NLU pipelines to end-to-end multimodal architectures. Method: We propose a speech-native large language model (LLM) framework integrating audio-adapted LLMs, cross-modal alignment mechanisms, joint speech-text pretraining, streaming inference, post-ASR correction, and enhanced multilingual accent robustness. Contribution/Results: (1) A unified architectural perspective bridging industrial voice assistants and open-domain/task-oriented agents; (2) Release of reproducible baselines, a system-level development roadmap, and standardized evaluation protocols; (3) Comprehensive curation of key datasets and explicit identification of open challenges—including privacy, safety, and rigorous multimodal evaluation. Our framework advances spoken dialogue systems toward next-generation architectures that are more robust, scalable, and inherently multimodal.
📝 Abstract
Spoken conversational agents are converging toward voice-native LLMs. This tutorial distills the path from cascaded ASR/NLU to end-to-end, retrieval-and vision-grounded systems. We frame adaptation of text LLMs to audio, cross-modal alignment, and joint speech-text training; review datasets, metrics, and robustness across accents and compare design choices (cascaded vs. E2E, post-ASR correction, streaming). We link industrial assistants to current open-domain and task-oriented agents, highlight reproducible baselines, and outline open problems in privacy, safety, and evaluation. Attendees leave with practical recipes and a clear systems-level roadmap.