LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis

📅 2025-05-05

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

To address the need for real-time, intelligent, and natural spoken dialogue interaction, this paper proposes SpokenChatbot—a fully end-to-end streaming spoken dialogue system. Methodologically, it introduces the first joint architecture integrating a large language model (Qwen2.5) with a streaming autoregressive speech decoder, enabling unified modeling of speech understanding, language generation, and speech synthesis—trained on only 200K dialogue turns, substantially reducing data dependency. Key contributions include: (1) a unified speech-text-speech representation framework that jointly optimizes ultra-low-latency (millisecond-level response) and high-fidelity text-to-speech; and (2) state-of-the-art performance across multi-turn spoken QA and instruction-following tasks, outperforming SOTA speech-language models such as GLM-4-Voice. Experiments demonstrate superior efficiency and effectiveness at reduced training scale, establishing a new paradigm for lightweight, high-performance spoken interaction.

Technology Category

Application Category

📝 Abstract

Real-time, intelligent, and natural speech interaction is an essential part of the next-generation human-computer interaction. Recent advancements have showcased the potential of building intelligent spoken chatbots based on large language models (LLMs). In this paper, we introduce LLaMA-Omni 2, a series of speech language models (SpeechLMs) ranging from 0.5B to 14B parameters, capable of achieving high-quality real-time speech interaction. LLaMA-Omni 2 is built upon the Qwen2.5 series models, integrating a speech encoder and an autoregressive streaming speech decoder. Despite being trained on only 200K multi-turn speech dialogue samples, LLaMA-Omni 2 demonstrates strong performance on several spoken question answering and speech instruction following benchmarks, surpassing previous state-of-the-art SpeechLMs like GLM-4-Voice, which was trained on millions of hours of speech data.

Problem

Research questions and friction points this paper is trying to address.

Achieving real-time intelligent speech interaction

Building LLM-based high-quality spoken chatbots

Improving SpeechLM performance with limited data

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based real-time spoken chatbot

Autoregressive streaming speech synthesis

Trained with minimal multi-turn dialogue samples

🔎 Similar Papers

LLaMA-Omni: Seamless Speech Interaction with Large Language Models

2024-09-10arXiv.orgCitations: 18