🤖 AI Summary
To address the performance degradation of automatic speech recognition (ASR) systems in recognizing acronyms, proper nouns, and domain-specific neologisms—largely due to scarce labeled training data—this paper proposes a self-supervised continual learning framework that jointly leverages presentation audio and corresponding slides. We introduce the novel use of publicly available slides as unsupervised semantic anchors, enabling memory-augmented ASR models to perform audio-slide self-supervised alignment, thereby facilitating neologism discovery and robust decoding. Subsequently, high-quality pseudo-labels are iteratively generated and used to incrementally train lightweight adapter weights. The method requires no manual annotation, achieving strong neologism recall (>80% in high-frequency scenarios) while preserving general ASR accuracy and model generalizability. Experimental results demonstrate significant improvements in domain adaptation efficiency without compromising cross-domain robustness.
📝 Abstract
Despite recent advances, Automatic Speech Recognition (ASR) systems are still far from perfect. Typical errors include acronyms, named entities, and domain-specific special words for which little or no labeled data is available. To address the problem of recognizing these words, we propose a self-supervised continual learning approach: Given the audio of a lecture talk with the corresponding slides, we bias the model towards decoding new words from the slides by using a memory-enhanced ASR model from the literature. Then, we perform inference on the talk, collecting utterances that contain detected new words into an adaptation data set. Continual learning is then performed by training adaptation weights added to the model on this data set. The whole procedure is iterated for many talks. We show that with this approach, we obtain increasing performance on the new words when they occur more frequently (more than 80% recall) while preserving the general performance of the model.