🤖 AI Summary
To address modality distortion, error propagation, and high latency inherent in ASR-TTS cascaded architectures for speech recognition and understanding, this paper presents a systematic survey of recent advances in end-to-end Speech Language Models (SpeechLMs). We propose the first unified analytical framework covering architectural design (speech encoder-decoder), representation schemes (discrete vs. continuous acoustic units), training paradigms (multi-stage pretraining plus instruction tuning), cross-modal alignment mechanisms, and speech-specific evaluation benchmarks. Innovatively, we construct a structured knowledge graph and an open-source resource repository (hosted on GitHub), harmonizing terminology and evaluation standards across the field. Our work delivers both theoretical guidance and practical infrastructure for SpeechLM research, significantly advancing the development of speech-native large language models.
📝 Abstract
Large Language Models (LLMs) have recently garnered significant attention, primarily for their capabilities in text-based interactions. However, natural human interaction often relies on speech, necessitating a shift towards voice-based models. A straightforward approach to achieve this involves a pipeline of ``Automatic Speech Recognition (ASR) + LLM + Text-to-Speech (TTS)", where input speech is transcribed to text, processed by an LLM, and then converted back to speech. Despite being straightforward, this method suffers from inherent limitations, such as information loss during modality conversion, significant latency due to the complex pipeline, and error accumulation across the three stages. To address these issues, Speech Language Models (SpeechLMs) -- end-to-end models that generate speech without converting from text -- have emerged as a promising alternative. This survey paper provides the first comprehensive overview of recent methodologies for constructing SpeechLMs, detailing the key components of their architecture and the various training recipes integral to their development. Additionally, we systematically survey the various capabilities of SpeechLMs, categorize their evaluation metrics, and discuss the challenges and future research directions in this rapidly evolving field. The GitHub repository is available at https://github.com/dreamtheater123/Awesome-SpeechLM-Survey