When Large Language Models Meet Speech: A Survey on Integration Approaches

📅 2025-02-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the technical bottlenecks in multimodal speech understanding and generation by systematically surveying recent advances in integrating large language models (LLMs) with speech modalities. We propose three unified integration paradigms—text-driven, latent-representation-driven, and audio-token-driven—and introduce the first structured taxonomy covering 200+ works, clarifying the evolutionary logic of paradigm shifts. We rigorously analyze three core challenges: cross-modal alignment, low-latency inference, and robust generalization. Key technical pathways are synthesized, including ASR post-processing, joint fine-tuning of speech encoders and LLMs, neural audio codecs, multi-stage prompt engineering, and modality adapters. The work provides a systematic methodology for speech-LLM architecture design, benchmark development, and domain-specific deployment.

Technology Category

Application Category

📝 Abstract
Recent advancements in large language models (LLMs) have spurred interest in expanding their application beyond text-based tasks. A large number of studies have explored integrating other modalities with LLMs, notably speech modality, which is naturally related to text. This paper surveys the integration of speech with LLMs, categorizing the methodologies into three primary approaches: text-based, latent-representation-based, and audio-token-based integration. We also demonstrate how these methods are applied across various speech-related applications and highlight the challenges in this field to offer inspiration for
Problem

Research questions and friction points this paper is trying to address.

Integration of speech with LLMs
Methodologies for speech-LLM integration
Challenges in speech-LLM applications
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates speech with language models
Uses text, latent, and audio tokens
Explores multimodal LLM applications
🔎 Similar Papers
No similar papers found.