Spoken Conversational Agents with Large Language Models

📅 2025-12-02

🏛️ Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Tutorial Abstracts

📈 Citations: 0

✨ Influential: 0

career value

207K/year

🤖 AI Summary

This work addresses core challenges in transitioning speech dialogue systems from cascaded ASR/NLU pipelines to end-to-end multimodal architectures. Method: We propose a speech-native large language model (LLM) framework integrating audio-adapted LLMs, cross-modal alignment mechanisms, joint speech-text pretraining, streaming inference, post-ASR correction, and enhanced multilingual accent robustness. Contribution/Results: (1) A unified architectural perspective bridging industrial voice assistants and open-domain/task-oriented agents; (2) Release of reproducible baselines, a system-level development roadmap, and standardized evaluation protocols; (3) Comprehensive curation of key datasets and explicit identification of open challenges—including privacy, safety, and rigorous multimodal evaluation. Our framework advances spoken dialogue systems toward next-generation architectures that are more robust, scalable, and inherently multimodal.

Technology Category

Application Category

📝 Abstract

Spoken conversational agents are converging toward voice-native LLMs. This tutorial distills the path from cascaded ASR/NLU to end-to-end, retrieval-and vision-grounded systems. We frame adaptation of text LLMs to audio, cross-modal alignment, and joint speech-text training; review datasets, metrics, and robustness across accents and compare design choices (cascaded vs. E2E, post-ASR correction, streaming). We link industrial assistants to current open-domain and task-oriented agents, highlight reproducible baselines, and outline open problems in privacy, safety, and evaluation. Attendees leave with practical recipes and a clear systems-level roadmap.

Problem

Research questions and friction points this paper is trying to address.

Adapting text LLMs to spoken conversational agents

Aligning cross-modal data for audio and vision integration

Addressing robustness across accents and system design choices

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adapting text LLMs to audio and cross-modal alignment

Comparing cascaded vs. end-to-end design choices and streaming

Providing reproducible baselines and systems-level roadmap

🔎 Similar Papers

A Survey on Recent Advances in LLM-Based Multi-turn Dialogue Systems

2024-02-28arXiv.orgCitations: 93