Pseudo Labels-based Neural Speech Enhancement for the AVSR Task in the MISP-Meeting Challenge

📅 2025-05-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the degradation of automatic speech recognition (ASR) performance in far-field multi-speaker meeting scenarios—specifically those defined in the MISP-Meeting Challenge Track 2—caused by strong background noise, reverberation, speech overlap, and topic diversity. To tackle these challenges, we propose a robust audio-visual speech recognition (AVSR) framework. Our key contributions are: (1) a novel Temporal-Level Synchronization (TLS) pseudo-label generation framework integrating time-level alignment with SNR-adaptive filtering; (2) a lightweight G-SpatialNet visual encoder that enhances spatial awareness; and (3) a unified pipeline combining guided sound source separation, multimodal fine-grained adaptation, and dynamic data augmentation. Evaluated on the Dev and Eval sets, our system achieves character error rates (CER) of 5.44% and 9.52%, respectively—representing relative improvements of 64.8% and 52.6% over the baseline—and secures second place in the challenge.

Technology Category

Application Category

📝 Abstract
This paper presents our system for the MISP-Meeting Challenge Track 2. The primary difficulty lies in the dataset, which contains strong background noise, reverberation, overlapping speech, and diverse meeting topics. To address these issues, we (a) designed G-SpatialNet, a speech enhancement (SE) model to improve Guided Source Separation (GSS) signals; (b) proposed TLS, a framework comprising time alignment, level alignment, and signal-to-noise ratio filtering, to generate signal-level pseudo labels for real-recorded far-field audio data, thereby facilitating SE models' training; and (c) explored fine-tuning strategies, data augmentation, and multimodal information to enhance the performance of pre-trained Automatic Speech Recognition (ASR) models in meeting scenarios. Finally, our system achieved character error rates (CERs) of 5.44% and 9.52% on the Dev and Eval sets, respectively, with relative improvements of 64.8% and 52.6% over the baseline, securing second place.
Problem

Research questions and friction points this paper is trying to address.

Enhancing speech in noisy, reverberant, overlapping meeting recordings
Generating pseudo labels for far-field audio to train SE models
Improving ASR performance in challenging meeting scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

G-SpatialNet enhances Guided Source Separation signals
TLS framework generates pseudo labels for training
Fine-tuning and multimodal data boost ASR performance
🔎 Similar Papers
No similar papers found.
Longjie Luo
Longjie Luo
Xiamen University
speech signal processing
Shenghui Lu
Shenghui Lu
Xiamen University
Speech enhancementSpeech recognition
L
Lin Li
School of Electronic Science and Engineering, Xiamen University, China
Q
Q. Hong
School of Informatics, Xiamen University, China