On The Landscape of Spoken Language Models: A Comprehensive Survey

📅 2025-04-11

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

The field of spoken language models (SLMs) suffers from terminological inconsistency, fragmented paradigms, and an unclear developmental trajectory. Method: We conduct the first systematic literature review on SLMs, constructing a structured knowledge graph covering publications from 2020–2025. Our synthesis integrates speech encoders (e.g., Whisper), text-based LMs (e.g., LLaMA), and multimodal alignment techniques, and introduces the first unified SLM taxonomy—distinguishing *pure speech sequence modeling* from *speech-encoder–text-LM fusion*—while clarifying their architectural designs, training strategies, and evaluation protocols. Contribution: This work establishes a consensus-oriented analytical framework for the field, identifies three core challenges—robustness, cross-lingual generalization, and low-resource adaptation—and provides foundational theory and a research roadmap guiding the evolution of SLMs from task-specific models toward general-purpose spoken language processing systems.

Technology Category

Application Category

📝 Abstract

The field of spoken language processing is undergoing a shift from training custom-built, task-specific models toward using and optimizing spoken language models (SLMs) which act as universal speech processing systems. This trend is similar to the progression toward universal language models that has taken place in the field of (text) natural language processing. SLMs include both"pure"language models of speech -- models of the distribution of tokenized speech sequences -- and models that combine speech encoders with text language models, often including both spoken and written input or output. Work in this area is very diverse, with a range of terminology and evaluation settings. This paper aims to contribute an improved understanding of SLMs via a unifying literature survey of recent work in the context of the evolution of the field. Our survey categorizes the work in this area by model architecture, training, and evaluation choices, and describes some key challenges and directions for future work.

Problem

Research questions and friction points this paper is trying to address.

Surveying diverse spoken language models (SLMs) for universal speech processing

Categorizing SLMs by architecture, training, and evaluation methods

Identifying key challenges and future directions in SLM research

Innovation

Methods, ideas, or system contributions that make the work stand out.

Universal speech processing systems replacing task-specific models

Combining speech encoders with text language models

Categorizing by architecture, training, and evaluation choices

🔎 Similar Papers

No similar papers found.