🤖 AI Summary
This paper addresses the technical bottlenecks in multimodal speech understanding and generation by systematically surveying recent advances in integrating large language models (LLMs) with speech modalities. We propose three unified integration paradigms—text-driven, latent-representation-driven, and audio-token-driven—and introduce the first structured taxonomy covering 200+ works, clarifying the evolutionary logic of paradigm shifts. We rigorously analyze three core challenges: cross-modal alignment, low-latency inference, and robust generalization. Key technical pathways are synthesized, including ASR post-processing, joint fine-tuning of speech encoders and LLMs, neural audio codecs, multi-stage prompt engineering, and modality adapters. The work provides a systematic methodology for speech-LLM architecture design, benchmark development, and domain-specific deployment.
📝 Abstract
Recent advancements in large language models (LLMs) have spurred interest in expanding their application beyond text-based tasks. A large number of studies have explored integrating other modalities with LLMs, notably speech modality, which is naturally related to text. This paper surveys the integration of speech with LLMs, categorizing the methodologies into three primary approaches: text-based, latent-representation-based, and audio-token-based integration. We also demonstrate how these methods are applied across various speech-related applications and highlight the challenges in this field to offer inspiration for