FLM-Audio: Natural Monologues Improves Native Full-Duplex Chatbots via Dual Training

📅 2025-09-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the alignment challenge between text monologues and audio streams arising from sampling-rate discrepancies. It proposes the “Natural Monologue” modeling paradigm, abandoning word-level alignment reliant on high-precision token timestamps to eliminate cascaded errors and preprocessing overhead. Methodologically, it introduces a forward-backward alternating two-stage training strategy, enabling end-to-end multi-channel synchronization by directly modeling listen-speak coordination over continuous text sequences in a 7B-parameter speech dialogue model. Key contributions are: (1) replacing forced temporal alignment with natural speech rhythm to emulate human cognitive dialogue behavior; and (2) enhancing temporal understanding and generation coordination via dual-stage training. Experiments demonstrate significantly reduced response latency, improved duplex interaction coherence, and superior subjective user experience—advancing full-duplex dialogue systems toward practical deployment.

Technology Category

Application Category

📝 Abstract
Full-duplex dialog models are designed to listen and speak simultaneously with rapid responses to fast-changing user input. Among existing approaches, native full-duplex models merges different channels (e.g. listen and speak) in a single time step, overcoming the high response latency inherent to time-division multiplexing time-division multiplexing (TDM) alternatives. Yet, a key challenge remains: aligning textual monologues with audio streams that operate at different bitrates. The prevailing solution relies on word-level alignment, but this can degrade the language ability of large pre-trained models. Moreover, it requires highly accurate timestamps for every token, which introduces cascading errors and increases pre-processing costs. In this paper, we propose textual monologues in continuous tokens sequence, namely "natural" monologues, which mimics humanoid cognitive behavior in dialogs. For temporal alignment, we alternate the position of the natural monologue - leading or trailing the audio - across different training stages. This "dual" training paradigm proves highly effective in building FLM-Audio, our 7B spoken dialog model that demonstrates superior responsiveness, duplexity, and chatting experiences, as confirmed by experimental results.
Problem

Research questions and friction points this paper is trying to address.

Aligning textual monologues with audio streams at different bitrates
Overcoming word-level alignment degradation of pre-trained models
Eliminating cascading errors from token timestamp requirements
Innovation

Methods, ideas, or system contributions that make the work stand out.

Natural monologues in continuous tokens sequence
Dual training with alternating monologue positions
Native full-duplex model merging simultaneous channels
🔎 Similar Papers
No similar papers found.
Yiqun Yao
Yiqun Yao
Unknown affiliation
X
Xiang Li
Beijing Academy of Artificial Intelligence, Beijing, China
X
Xin Jiang
Beijing Academy of Artificial Intelligence, Beijing, China
X
Xuezhi Fang
Beijing Academy of Artificial Intelligence, Beijing, China
Naitong Yu
Naitong Yu
Beijing Academy of Artificial Intelligence
Large Language ModelsNatural Language ProcessingArtificial Intelligence
W
Wenjia Ma
Spin Matrix, China
A
Aixin Sun
Nanyang Technological University, Singapore
Y
Yequan Wang
Beijing Academy of Artificial Intelligence, Beijing, China