DuplexMamba: Enhancing Real-time Speech Conversations with Duplex and Streaming Capabilities

πŸ“… 2025-02-16
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Traditional Transformer-based dialogue systems suffer from quadratic computational complexity in autoregressive decoding with respect to input length, hindering real-time, full-duplex speech interaction. To address this, we propose an end-to-end multimodal duplex speech dialogue system. Our method introduces: (1) a novel duplex autoregressive decoding mechanism enabling synchronous, streaming processing of speech input and text output; (2) the first deep adaptation of the Mamba state-space model for joint speech encoding and language modeling, replacing computationally expensive attention mechanisms; and (3) a streaming speech encoder integrated within an end-to-end jointly trained framework. Experiments on ASR and virtual assistant benchmarks demonstrate that our system achieves accuracy comparable to state-of-the-art Transformer models while reducing inference latency by 57%, significantly enhancing real-time responsiveness and interaction naturalness.

Technology Category

Application Category

πŸ“ Abstract
Real-time speech conversation is essential for natural and efficient human-machine interactions, requiring duplex and streaming capabilities. Traditional Transformer-based conversational chatbots operate in a turn-based manner and exhibit quadratic computational complexity that grows as the input size increases. In this paper, we propose DuplexMamba, a Mamba-based end-to-end multimodal duplex model for speech-to-text conversation. DuplexMamba enables simultaneous input processing and output generation, dynamically adjusting to support real-time streaming. Specifically, we develop a Mamba-based speech encoder and adapt it with a Mamba-based language model. Furthermore, we introduce a novel duplex decoding strategy that enables DuplexMamba to process input and generate output simultaneously. Experimental results demonstrate that DuplexMamba successfully implements duplex and streaming capabilities while achieving performance comparable to several recently developed Transformer-based models in automatic speech recognition (ASR) tasks and voice assistant benchmark evaluations.
Problem

Research questions and friction points this paper is trying to address.

Enhances real-time speech conversations
Reduces quadratic computational complexity
Implements duplex and streaming capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mamba-based end-to-end model
Simultaneous input-output processing
Novel duplex decoding strategy
πŸ”Ž Similar Papers
No similar papers found.
Xiangyu Lu
Xiangyu Lu
ε“ˆε°”ζ»¨ε·₯业倧学
Wang Xu
Wang Xu
Harbin Institute of Technology
natural language processingartificial intelligence
H
Haoyu Wang
Tsinghua University, Beijing, China
Hongyun Zhou
Hongyun Zhou
Harbin Institute of Technology Master
PEFTMachine translationLLM
Haiyan Zhao
Haiyan Zhao
Peking University
C
Conghui Zhu
Faculty of Computing, Harbin Institute of Technology, Harbin, China
T
Tiejun Zhao
Faculty of Computing, Harbin Institute of Technology, Harbin, China
M
Muyun Yang
Faculty of Computing, Harbin Institute of Technology, Harbin, China