Pseudo Labels-based Neural Speech Enhancement for the AVSR Task in the MISP-Meeting Challenge

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This paper addresses the degradation of automatic speech recognition (ASR) performance in far-field multi-speaker meeting scenarios—specifically those defined in the MISP-Meeting Challenge Track 2—caused by strong background noise, reverberation, speech overlap, and topic diversity. To tackle these challenges, we propose a robust audio-visual speech recognition (AVSR) framework. Our key contributions are: (1) a novel Temporal-Level Synchronization (TLS) pseudo-label generation framework integrating time-level alignment with SNR-adaptive filtering; (2) a lightweight G-SpatialNet visual encoder that enhances spatial awareness; and (3) a unified pipeline combining guided sound source separation, multimodal fine-grained adaptation, and dynamic data augmentation. Evaluated on the Dev and Eval sets, our system achieves character error rates (CER) of 5.44% and 9.52%, respectively—representing relative improvements of 64.8% and 52.6% over the baseline—and secures second place in the challenge.

Technology Category

Application Category

📝 Abstract

This paper presents our system for the MISP-Meeting Challenge Track 2. The primary difficulty lies in the dataset, which contains strong background noise, reverberation, overlapping speech, and diverse meeting topics. To address these issues, we (a) designed G-SpatialNet, a speech enhancement (SE) model to improve Guided Source Separation (GSS) signals; (b) proposed TLS, a framework comprising time alignment, level alignment, and signal-to-noise ratio filtering, to generate signal-level pseudo labels for real-recorded far-field audio data, thereby facilitating SE models' training; and (c) explored fine-tuning strategies, data augmentation, and multimodal information to enhance the performance of pre-trained Automatic Speech Recognition (ASR) models in meeting scenarios. Finally, our system achieved character error rates (CERs) of 5.44% and 9.52% on the Dev and Eval sets, respectively, with relative improvements of 64.8% and 52.6% over the baseline, securing second place.

Problem

Research questions and friction points this paper is trying to address.

Enhancing speech in noisy, reverberant, overlapping meeting recordings

Generating pseudo labels for far-field audio to train SE models

Improving ASR performance in challenging meeting scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

G-SpatialNet enhances Guided Source Separation signals

TLS framework generates pseudo labels for training

Fine-tuning and multimodal data boost ASR performance

🔎 Similar Papers

ctPuLSE: Close-Talk, and Pseudo-Label Based Far-Field, Speech Enhancement