🤖 AI Summary
To address the need for real-time, intelligent, and natural spoken dialogue interaction, this paper proposes SpokenChatbot—a fully end-to-end streaming spoken dialogue system. Methodologically, it introduces the first joint architecture integrating a large language model (Qwen2.5) with a streaming autoregressive speech decoder, enabling unified modeling of speech understanding, language generation, and speech synthesis—trained on only 200K dialogue turns, substantially reducing data dependency. Key contributions include: (1) a unified speech-text-speech representation framework that jointly optimizes ultra-low-latency (millisecond-level response) and high-fidelity text-to-speech; and (2) state-of-the-art performance across multi-turn spoken QA and instruction-following tasks, outperforming SOTA speech-language models such as GLM-4-Voice. Experiments demonstrate superior efficiency and effectiveness at reduced training scale, establishing a new paradigm for lightweight, high-performance spoken interaction.
📝 Abstract
Real-time, intelligent, and natural speech interaction is an essential part of the next-generation human-computer interaction. Recent advancements have showcased the potential of building intelligent spoken chatbots based on large language models (LLMs). In this paper, we introduce LLaMA-Omni 2, a series of speech language models (SpeechLMs) ranging from 0.5B to 14B parameters, capable of achieving high-quality real-time speech interaction. LLaMA-Omni 2 is built upon the Qwen2.5 series models, integrating a speech encoder and an autoregressive streaming speech decoder. Despite being trained on only 200K multi-turn speech dialogue samples, LLaMA-Omni 2 demonstrates strong performance on several spoken question answering and speech instruction following benchmarks, surpassing previous state-of-the-art SpeechLMs like GLM-4-Voice, which was trained on millions of hours of speech data.