State-Space Models in Efficient Whispered and Multi-dialect Speech Recognition

📅 2025-06-20

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

Addressing the dual challenges of whispered speech recognition and multi-dialect robustness under low-resource conditions, this paper proposes the first lightweight, efficient automatic speech recognition (ASR) framework based on the Mamba state-space model. Our method fuses fine-tuned acoustic features from four self-supervised models—Wav2Vec 2.0, WavLM, HuBERT, and Whisper—and leverages Mamba’s capability to model long-range temporal dependencies. To our knowledge, this is the first approach enabling joint recognition of whispered and canonical speech while seamlessly adapting to Singaporean, American, and Irish English dialects. Evaluated on wTIMIT and CHAINS, our framework achieves state-of-the-art performance, significantly reducing computational overhead and reliance on whispered speech annotations—requiring only a small number of whispered samples—while preserving strong generalization across domains and dialects. The complete implementation is open-sourced.

Technology Category

Application Category

📝 Abstract

Whispered speech recognition presents significant challenges for conventional automatic speech recognition systems, particularly when combined with dialect variation. However, utilizing an efficient method to solve this problem using a low-range dataset and processing load is beneficial. This paper proposes a solution using a Mamba-based state-space model and four fine-tuned self-supervised models consisting of Wav2Vec2, WavLM, HuBERT, and Whisper to address the dual challenges of whispered speech and dialect diversity. Based on our knowledge, this represents the best performance reported on the wTIMIT and CHAINS datasets for whispered speech recognition. We trained the models using whispered and normal speech data across Singaporean, US, and Irish dialects. The findings demonstrated that utilizing the proposed Mamba-based model could work as a highly efficient model trained with low amounts of whispered data to simultaneously work on whispered and normal speech recognition. The code for this work is freely available.

Problem

Research questions and friction points this paper is trying to address.

Challenges in whispered speech and dialect recognition

Efficient solution using Mamba-based state-space model

Low-data training for multi-dialect speech recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mamba-based state-space model for speech recognition

Fine-tuned self-supervised models (Wav2Vec2, WavLM, HuBERT, Whisper)

Low-range dataset for efficient whispered speech processing

🔎 Similar Papers

M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper