Spoken Conversational Agents with Large Language Models

📅 2025-12-02
🏛️ Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses core challenges in transitioning speech dialogue systems from cascaded ASR/NLU pipelines to end-to-end multimodal architectures. Method: We propose a speech-native large language model (LLM) framework integrating audio-adapted LLMs, cross-modal alignment mechanisms, joint speech-text pretraining, streaming inference, post-ASR correction, and enhanced multilingual accent robustness. Contribution/Results: (1) A unified architectural perspective bridging industrial voice assistants and open-domain/task-oriented agents; (2) Release of reproducible baselines, a system-level development roadmap, and standardized evaluation protocols; (3) Comprehensive curation of key datasets and explicit identification of open challenges—including privacy, safety, and rigorous multimodal evaluation. Our framework advances spoken dialogue systems toward next-generation architectures that are more robust, scalable, and inherently multimodal.

Technology Category

Application Category

📝 Abstract
Spoken conversational agents are converging toward voice-native LLMs. This tutorial distills the path from cascaded ASR/NLU to end-to-end, retrieval-and vision-grounded systems. We frame adaptation of text LLMs to audio, cross-modal alignment, and joint speech-text training; review datasets, metrics, and robustness across accents and compare design choices (cascaded vs. E2E, post-ASR correction, streaming). We link industrial assistants to current open-domain and task-oriented agents, highlight reproducible baselines, and outline open problems in privacy, safety, and evaluation. Attendees leave with practical recipes and a clear systems-level roadmap.
Problem

Research questions and friction points this paper is trying to address.

Adapting text LLMs to spoken conversational agents
Aligning cross-modal data for audio and vision integration
Addressing robustness across accents and system design choices
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapting text LLMs to audio and cross-modal alignment
Comparing cascaded vs. end-to-end design choices and streaming
Providing reproducible baselines and systems-level roadmap
🔎 Similar Papers
No similar papers found.