π€ AI Summary
Traditional Transformer-based dialogue systems suffer from quadratic computational complexity in autoregressive decoding with respect to input length, hindering real-time, full-duplex speech interaction. To address this, we propose an end-to-end multimodal duplex speech dialogue system. Our method introduces: (1) a novel duplex autoregressive decoding mechanism enabling synchronous, streaming processing of speech input and text output; (2) the first deep adaptation of the Mamba state-space model for joint speech encoding and language modeling, replacing computationally expensive attention mechanisms; and (3) a streaming speech encoder integrated within an end-to-end jointly trained framework. Experiments on ASR and virtual assistant benchmarks demonstrate that our system achieves accuracy comparable to state-of-the-art Transformer models while reducing inference latency by 57%, significantly enhancing real-time responsiveness and interaction naturalness.
π Abstract
Real-time speech conversation is essential for natural and efficient human-machine interactions, requiring duplex and streaming capabilities. Traditional Transformer-based conversational chatbots operate in a turn-based manner and exhibit quadratic computational complexity that grows as the input size increases. In this paper, we propose DuplexMamba, a Mamba-based end-to-end multimodal duplex model for speech-to-text conversation. DuplexMamba enables simultaneous input processing and output generation, dynamically adjusting to support real-time streaming. Specifically, we develop a Mamba-based speech encoder and adapt it with a Mamba-based language model. Furthermore, we introduce a novel duplex decoding strategy that enables DuplexMamba to process input and generate output simultaneously. Experimental results demonstrate that DuplexMamba successfully implements duplex and streaming capabilities while achieving performance comparable to several recently developed Transformer-based models in automatic speech recognition (ASR) tasks and voice assistant benchmark evaluations.