AVE Speech Dataset: A Comprehensive Benchmark for Multi-Modal Speech Recognition Integrating Audio, Visual, and Electromyographic Signals

📅 2025-01-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the insufficient robustness of automatic speech recognition (ASR) for elderly individuals with hearing and speech impairments in noisy environments and cross-speaker scenarios, this work introduces the first large-scale, sentence-level, multimodal ASR benchmark dataset for Mandarin Chinese. The dataset comprises synchronized audio, lip-motion videos, and six-channel electromyographic (EMG) signals from 100 participants, with over 55 hours per modality—specifically designed for high-noise and cross-subject non-acoustic ASR. It is the first publicly available sentence-level Mandarin dataset integrating acoustic, visual, and EMG modalities, accompanied by a custom multimodal synchronization acquisition system, EMG feature extraction pipeline, lip-video encoding framework, and an end-to-end cross-modal fusion modeling approach. Experiments demonstrate that the proposed multimodal joint model significantly improves recognition accuracy, notably outperforming unimodal baselines under severe noise and cross-subject conditions—validating the complementary efficacy of EMG and visual cues in acoustically degraded settings.

Technology Category

Application Category

📝 Abstract
The global aging population faces considerable challenges, particularly in communication, due to the prevalence of hearing and speech impairments. To address these, we introduce the AVE speech dataset, a comprehensive multi-modal benchmark for speech recognition tasks. The dataset includes a 100-sentence Mandarin Chinese corpus with audio signals, lip-region video recordings, and six-channel electromyography (EMG) data, collected from 100 participants. Each subject read the entire corpus ten times, with each sentence averaging approximately two seconds in duration, resulting in over 55 hours of multi-modal speech data per modality. Experiments demonstrate that combining these modalities significantly improves recognition performance, particularly in cross-subject and high-noise environments. To our knowledge, this is the first publicly available sentence-level dataset integrating these three modalities for large-scale Mandarin speech recognition. We expect this dataset to drive advancements in both acoustic and non-acoustic speech recognition research, enhancing cross-modal learning and human-machine interaction.
Problem

Research questions and friction points this paper is trying to address.

Speech Recognition
Noisy Environment
Elderly Communication
Innovation

Methods, ideas, or system contributions that make the work stand out.

AVE speech dataset
multimodal speech recognition
enhanced accuracy
🔎 Similar Papers
No similar papers found.
D
Dongliang Zhou
Defense Innovation Institute, Academy of Military Sciences, Beijing, China; Tianjin Artificial Intelligence Innovation Center, Tianjin, China; and Harbin Institute of Technology, Shenzhen, China
Yakun Zhang
Yakun Zhang
Harbin Institute of Technology, Shenzhen
Software EngineeringProgram AnalysisGUI AgentLarge Language Model
J
Jinghan Wu
Tianjin University, Tianjin, China
Xingyu Zhang
Xingyu Zhang
Horizon Robotics Inc
NLP&VLM&AD
Liang Xie
Liang Xie
Wuhan University of Technology
Time Series ForecastingCross-modal Learning
E
Erwei Yin
Defense Innovation Institute, Academy of Military Sciences, Beijing, China; and Tianjin Artificial Intelligence Innovation Center, Tianjin, China