Mamba for Streaming ASR Combined with Unimodal Aggregation

πŸ“… 2024-09-30
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 1
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
To address the trade-off between accuracy and latency in streaming Chinese automatic speech recognition (ASR), this paper proposes an efficient Mamba-based encoder. We introduce streaming unimodal aggregation (UMA)β€”the first of its kindβ€”along with a complementary early termination (ET) mechanism, and design a controllable lookahead modeling strategy to enable dynamic token activation detection and progressive output generation. This work marks the first integration of the Mamba state space model into streaming ASR, including full end-to-end adaptation. Evaluated on two Chinese benchmark datasets, our approach achieves accuracy comparable to Transformer-based models while significantly reducing end-to-end latency. Experimental results demonstrate that linear-complexity state space models offer both efficacy and practicality for real-time speech recognition.

Technology Category

Application Category

πŸ“ Abstract
This paper works on streaming automatic speech recognition (ASR). Mamba, a recently proposed state space model, has demonstrated the ability to match or surpass Transformers in various tasks while benefiting from a linear complexity advantage. We explore the efficiency of Mamba encoder for streaming ASR and propose an associated lookahead mechanism for leveraging controllable future information. Additionally, a streaming-style unimodal aggregation (UMA) method is implemented, which automatically detects token activity and streamingly triggers token output, and meanwhile aggregates feature frames for better learning token representation. Based on UMA, an early termination (ET) method is proposed to further reduce recognition latency. Experiments conducted on two Mandarin Chinese datasets demonstrate that the proposed model achieves competitive ASR performance in terms of both recognition accuracy and latency.
Problem

Research questions and friction points this paper is trying to address.

Real-time speech recognition
Accuracy improvement
Computational efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mamba ASR system
Future-aware mechanism
UMA and ET methods for efficiency
πŸ”Ž Similar Papers
No similar papers found.
Ying Fang
Ying Fang
Westlake University; Zhejiang University
speech recognition
X
Xiaofei Li
School of Engineering, Westlake University, China; Institute of Advanced Technology, Westlake Institute for Advanced Study, China