OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

📅 2024-10-23
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
Existing speech dialogue systems struggle to emulate natural human full-duplex interaction—such as interjections, speech overlaps, and immediate turn-taking. Method: This paper proposes an end-to-end full-duplex speech dialogue system that requires no architectural modification to the GPT backbone. It introduces a novel three-stage post-training paradigm: (i) cross-modal alignment, (ii) progressive learning from half-duplex to full-duplex behavior, and (iii) unified “flattening” of speech-text joint representations. The system performs speech-text joint modeling using a pure text-based large language model, enabling real-time bidirectional speech input and output. Contribution/Results: Experiments demonstrate significant reduction in end-to-end latency, improved speech naturalness and interaction fluency, and high-quality synchronous interaction—all while preserving model compatibility. Code and audio examples are publicly released.

Technology Category

Application Category

📝 Abstract
Full-duplex spoken dialogue systems significantly surpass traditional turn-based dialogue systems, as they allow simultaneous bidirectional communication, closely mirroring human-human interactions. However, achieving low latency and natural interactions in full-duplex dialogue systems remains a significant challenge, especially considering human conversation dynamics such as interruptions, backchannels, and overlapping speech. In this paper, we introduce a novel End-to-End GPT-based model OmniFlatten for full-duplex conversation, capable of effectively modeling the complex behaviors inherent to natural conversations with low latency. To achieve full-duplex conversation capabilities, we propose a multi-stage post-training scheme that progressively adapts a text large language model (LLM) backbone into a speech-text dialogue LLM, capable of generating text and speech in real time, without modifying the architecture of the backbone LLM. The training process comprises three stages: modality alignment, half-duplex dialogue learning, and full-duplex dialogue learning. In all training stages, we standardize the data using a flattening operation, which enables unifying the training methods and the GPT backbone across different modalities and tasks. Our approach offers a simple modeling technique and a promising research direction for developing efficient and natural end-to-end full-duplex spoken dialogue systems. Audio samples of dialogues generated by OmniFlatten can be found at this web site (https://omniflatten.github.io/).
Problem

Research questions and friction points this paper is trying to address.

Full-duplex conversation
Natural language processing
Speech interaction
Innovation

Methods, ideas, or system contributions that make the work stand out.

OmniFlatten
Tri-Phase Training Strategy
Full-Duplex Conversation Enhancement
🔎 Similar Papers
No similar papers found.
Q
Qinglin Zhang
Tongyi Lab
L
Luyao Cheng
Tongyi Lab
Chong Deng
Chong Deng
alibaba group
machine learningnatural language processing
Q
Qian Chen
Tongyi Lab
W
Wen Wang
Tongyi Lab
S
Siqi Zheng
Tongyi Lab
Jiaqing Liu
Jiaqing Liu
Renmin University of China
Natural Language ProcessingDeep LearningMachine LearningFinance
Hai Yu
Hai Yu
Nankai University
RoboticsNonlinear Control
C
Chaohong Tan
Tongyi Lab