DuplexMamba: Enhancing Real-time Speech Conversations with Duplex and Streaming Capabilities

📅 2025-02-16

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Traditional Transformer-based dialogue systems suffer from quadratic computational complexity in autoregressive decoding with respect to input length, hindering real-time, full-duplex speech interaction. To address this, we propose an end-to-end multimodal duplex speech dialogue system. Our method introduces: (1) a novel duplex autoregressive decoding mechanism enabling synchronous, streaming processing of speech input and text output; (2) the first deep adaptation of the Mamba state-space model for joint speech encoding and language modeling, replacing computationally expensive attention mechanisms; and (3) a streaming speech encoder integrated within an end-to-end jointly trained framework. Experiments on ASR and virtual assistant benchmarks demonstrate that our system achieves accuracy comparable to state-of-the-art Transformer models while reducing inference latency by 57%, significantly enhancing real-time responsiveness and interaction naturalness.

Technology Category

Application Category

📝 Abstract

Real-time speech conversation is essential for natural and efficient human-machine interactions, requiring duplex and streaming capabilities. Traditional Transformer-based conversational chatbots operate in a turn-based manner and exhibit quadratic computational complexity that grows as the input size increases. In this paper, we propose DuplexMamba, a Mamba-based end-to-end multimodal duplex model for speech-to-text conversation. DuplexMamba enables simultaneous input processing and output generation, dynamically adjusting to support real-time streaming. Specifically, we develop a Mamba-based speech encoder and adapt it with a Mamba-based language model. Furthermore, we introduce a novel duplex decoding strategy that enables DuplexMamba to process input and generate output simultaneously. Experimental results demonstrate that DuplexMamba successfully implements duplex and streaming capabilities while achieving performance comparable to several recently developed Transformer-based models in automatic speech recognition (ASR) tasks and voice assistant benchmark evaluations.

Problem

Research questions and friction points this paper is trying to address.

Enhances real-time speech conversations

Reduces quadratic computational complexity

Implements duplex and streaming capabilities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mamba-based end-to-end model

Simultaneous input-output processing

Novel duplex decoding strategy

🔎 Similar Papers

No similar papers found.