🤖 AI Summary
This paper addresses the degradation of automatic speech recognition (ASR) performance in far-field multi-speaker meeting scenarios—specifically those defined in the MISP-Meeting Challenge Track 2—caused by strong background noise, reverberation, speech overlap, and topic diversity. To tackle these challenges, we propose a robust audio-visual speech recognition (AVSR) framework. Our key contributions are: (1) a novel Temporal-Level Synchronization (TLS) pseudo-label generation framework integrating time-level alignment with SNR-adaptive filtering; (2) a lightweight G-SpatialNet visual encoder that enhances spatial awareness; and (3) a unified pipeline combining guided sound source separation, multimodal fine-grained adaptation, and dynamic data augmentation. Evaluated on the Dev and Eval sets, our system achieves character error rates (CER) of 5.44% and 9.52%, respectively—representing relative improvements of 64.8% and 52.6% over the baseline—and secures second place in the challenge.
📝 Abstract
This paper presents our system for the MISP-Meeting Challenge Track 2. The primary difficulty lies in the dataset, which contains strong background noise, reverberation, overlapping speech, and diverse meeting topics. To address these issues, we (a) designed G-SpatialNet, a speech enhancement (SE) model to improve Guided Source Separation (GSS) signals; (b) proposed TLS, a framework comprising time alignment, level alignment, and signal-to-noise ratio filtering, to generate signal-level pseudo labels for real-recorded far-field audio data, thereby facilitating SE models' training; and (c) explored fine-tuning strategies, data augmentation, and multimodal information to enhance the performance of pre-trained Automatic Speech Recognition (ASR) models in meeting scenarios. Finally, our system achieved character error rates (CERs) of 5.44% and 9.52% on the Dev and Eval sets, respectively, with relative improvements of 64.8% and 52.6% over the baseline, securing second place.