🤖 AI Summary
In classroom environments, automatic speech recognition (ASR) performance degrades significantly due to multiple concurrent noise sources (e.g., overlapping student speech, HVAC/projector noise), multi-microphone acquisition, and strong reverberation. To address this, we propose a domain adaptation method for Wav2vec 2.0 based on continual pretraining (CPT), the first systematic validation of CPT on real-world classroom speech. Our approach performs self-supervised representation refinement using only unlabeled classroom recordings—requiring no downstream fine-tuning—thereby enhancing cross-noise-type, cross-device, and cross-environment generalization robustness. Experiments demonstrate a relative word error rate (WER) reduction exceeding 10%, with particularly substantial improvements under low signal-to-noise ratio (SNR), high reverberation, and far-field recording conditions. This work establishes a novel paradigm for lightweight, adaptation-free robust ASR in educational AI applications.
📝 Abstract
Creating Automatic Speech Recognition (ASR) systems that are robust and resilient to classroom conditions is paramount to the development of AI tools to aid teachers and students. In this work, we study the efficacy of continued pretraining (CPT) in adapting Wav2vec2.0 to the classroom domain. We show that CPT is a powerful tool in that regard and reduces the Word Error Rate (WER) of Wav2vec2.0-based models by upwards of 10%. More specifically, CPT improves the model's robustness to different noises, microphones and classroom conditions.