🤖 AI Summary
This study addresses the limitation of pretrained automatic speech recognition (ASR) models—namely, their neglect of prosodic cues such as pitch accent—which hinders low-resource fine-tuning performance. We propose a multitask pretraining framework that jointly optimizes ASR and pitch-accent detection within a shared speech representation space, enabling synchronized modeling of lexical content and prosodic structure. Leveraging semi-supervised speech representations, the framework achieves cooperative optimization across tasks. On LibriSpeech, our method reduces ASR word error rate (WER) by 28.3% and significantly improves pitch-accent detection F1 score, narrowing the gap with state-of-the-art methods by 41%. Crucially, it provides the first empirical validation—under low-resource fine-tuning—that explicit prosodic modeling enhances ASR robustness. The core innovation lies in embedding pitch accent as a structured prior into ASR pretraining, thereby strengthening the model’s capacity to perceive and exploit prosodic information.
📝 Abstract
We show the performance of Automatic Speech Recognition (ASR) systems that use semi-supervised speech representations can be boosted by a complimentary pitch accent detection module, by introducing a joint ASR and pitch accent detection model. The pitch accent detection component of our model achieves a significant improvement on the state-of-the-art for the task, closing the gap in F1-score by 41%. Additionally, the ASR performance in joint training decreases WER by 28.3% on LibriSpeech, under limited resource fine-tuning. With these results, we show the importance of extending pretrained speech models to retain or re-learn important prosodic cues such as pitch accent.