AISHELL6-whisper: A Chinese Mandarin Audio-visual Whisper Speech Dataset with Speech Recognition Baselines

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Chinese audio-visual whispered speech recognition suffers from a severe lack of large-scale, annotated data. To address this gap, we introduce AISHELL6-Whisper—the first open-source, large-scale Chinese audio-visual whispered speech dataset, comprising 30 hours of parallel whispered/normal speech paired with synchronized frontal facial videos. We propose a Whisper-Flamingo–based baseline model for audio-visual whispered speech recognition, featuring two key innovations: (1) a parallel multimodal embedding alignment strategy to jointly model audio and visual cues, and (2) a whisper-spectrum adaptive projection layer to enhance spectral representation fidelity. The model is trained end-to-end. On the AISHELL6-Whisper test set, it achieves character error rates of 4.13% for whispered speech and 1.11% for normal speech. Furthermore, on the wTIMIT benchmark, it establishes a new state-of-the-art, empirically demonstrating the critical contribution of lip-motion information to whispered speech recognition.

Technology Category

Application Category

📝 Abstract
Whisper speech recognition is crucial not only for ensuring privacy in sensitive communications but also for providing a critical communication bridge for patients under vocal restraint and enabling discrete interaction in noise-sensitive environments. The development of Chinese mandarin audio-visual whisper speech recognition is hindered by the lack of large-scale datasets. We present AISHELL6-Whisper, a large-scale open-source audio-visual whisper speech dataset, featuring 30 hours each of whisper speech and parallel normal speech, with synchronized frontal facial videos. Moreover, we propose an audio-visual speech recognition (AVSR) baseline based on the Whisper-Flamingo framework, which integrates a parallel training strategy to align embeddings across speech types, and employs a projection layer to adapt to whisper speech's spectral properties. The model achieves a Character Error Rate (CER) of 4.13% for whisper speech and 1.11% for normal speech in the test set of our dataset, and establishes new state-of-the-art results on the wTIMIT benchmark. The dataset and the AVSR baseline codes are open-sourced at https://zutm.github.io/AISHELL6-Whisper.
Problem

Research questions and friction points this paper is trying to address.

Lack of large-scale Chinese Mandarin whisper speech datasets
Developing audio-visual whisper speech recognition systems
Addressing spectral differences between whisper and normal speech
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dataset with synchronized audio-visual whisper speech
AVSR baseline using Whisper-Flamingo framework
Parallel training strategy aligning embeddings across speech types
🔎 Similar Papers
No similar papers found.
C
Cancan Li
School of Computer Science, Wuhan University, Wuhan, China
F
Fei Su
School of Computer Science, Wuhan University, Wuhan, China
Juan Liu
Juan Liu
Wuhan University
Data MiningArtificial Intelligence in BioinformaticsBiomedicine
Hui Bu
Hui Bu
aishell
Speech Recognition、Speech databases and text corpora、Special topics on speech databases and
Y
Yulong Wan
AI Center, OPPO, Beijing, China
H
Hongbin Suo
AI Center, OPPO, Beijing, China
M
Ming Li
School of Artificial Intelligence, Wuhan University, Wuhan, China