AISHELL6-whisper: A Chinese Mandarin Audio-visual Whisper Speech Dataset with Speech Recognition Baselines

📅 2025-09-28

📈 Citations: 0

✨ Influential: 0

career value

237K/year

🤖 AI Summary

Chinese audio-visual whispered speech recognition suffers from a severe lack of large-scale, annotated data. To address this gap, we introduce AISHELL6-Whisper—the first open-source, large-scale Chinese audio-visual whispered speech dataset, comprising 30 hours of parallel whispered/normal speech paired with synchronized frontal facial videos. We propose a Whisper-Flamingo–based baseline model for audio-visual whispered speech recognition, featuring two key innovations: (1) a parallel multimodal embedding alignment strategy to jointly model audio and visual cues, and (2) a whisper-spectrum adaptive projection layer to enhance spectral representation fidelity. The model is trained end-to-end. On the AISHELL6-Whisper test set, it achieves character error rates of 4.13% for whispered speech and 1.11% for normal speech. Furthermore, on the wTIMIT benchmark, it establishes a new state-of-the-art, empirically demonstrating the critical contribution of lip-motion information to whispered speech recognition.

Technology Category

Application Category

📝 Abstract

Whisper speech recognition is crucial not only for ensuring privacy in sensitive communications but also for providing a critical communication bridge for patients under vocal restraint and enabling discrete interaction in noise-sensitive environments. The development of Chinese mandarin audio-visual whisper speech recognition is hindered by the lack of large-scale datasets. We present AISHELL6-Whisper, a large-scale open-source audio-visual whisper speech dataset, featuring 30 hours each of whisper speech and parallel normal speech, with synchronized frontal facial videos. Moreover, we propose an audio-visual speech recognition (AVSR) baseline based on the Whisper-Flamingo framework, which integrates a parallel training strategy to align embeddings across speech types, and employs a projection layer to adapt to whisper speech's spectral properties. The model achieves a Character Error Rate (CER) of 4.13% for whisper speech and 1.11% for normal speech in the test set of our dataset, and establishes new state-of-the-art results on the wTIMIT benchmark. The dataset and the AVSR baseline codes are open-sourced at https://zutm.github.io/AISHELL6-Whisper.

Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale Chinese Mandarin whisper speech datasets

Developing audio-visual whisper speech recognition systems

Addressing spectral differences between whisper and normal speech

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dataset with synchronized audio-visual whisper speech

AVSR baseline using Whisper-Flamingo framework

Parallel training strategy aligning embeddings across speech types

🔎 Similar Papers

No similar papers found.