When Large Language Models Meet Speech: A Survey on Integration Approaches

📅 2025-02-26

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This paper addresses the technical bottlenecks in multimodal speech understanding and generation by systematically surveying recent advances in integrating large language models (LLMs) with speech modalities. We propose three unified integration paradigms—text-driven, latent-representation-driven, and audio-token-driven—and introduce the first structured taxonomy covering 200+ works, clarifying the evolutionary logic of paradigm shifts. We rigorously analyze three core challenges: cross-modal alignment, low-latency inference, and robust generalization. Key technical pathways are synthesized, including ASR post-processing, joint fine-tuning of speech encoders and LLMs, neural audio codecs, multi-stage prompt engineering, and modality adapters. The work provides a systematic methodology for speech-LLM architecture design, benchmark development, and domain-specific deployment.

Technology Category

Application Category

📝 Abstract

Recent advancements in large language models (LLMs) have spurred interest in expanding their application beyond text-based tasks. A large number of studies have explored integrating other modalities with LLMs, notably speech modality, which is naturally related to text. This paper surveys the integration of speech with LLMs, categorizing the methodologies into three primary approaches: text-based, latent-representation-based, and audio-token-based integration. We also demonstrate how these methods are applied across various speech-related applications and highlight the challenges in this field to offer inspiration for

Problem

Research questions and friction points this paper is trying to address.

Integration of speech with LLMs

Methodologies for speech-LLM integration

Challenges in speech-LLM applications

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates speech with language models

Uses text, latent, and audio tokens

Explores multimodal LLM applications

🔎 Similar Papers

SSR: Alignment-Aware Modality Connector for Speech Language Models