Pitch Accent Detection improves Pretrained Automatic Speech Recognition

📅 2025-08-06

📈 Citations: 0

✨ Influential: 0

career value

142K/year

🤖 AI Summary

This study addresses the limitation of pretrained automatic speech recognition (ASR) models—namely, their neglect of prosodic cues such as pitch accent—which hinders low-resource fine-tuning performance. We propose a multitask pretraining framework that jointly optimizes ASR and pitch-accent detection within a shared speech representation space, enabling synchronized modeling of lexical content and prosodic structure. Leveraging semi-supervised speech representations, the framework achieves cooperative optimization across tasks. On LibriSpeech, our method reduces ASR word error rate (WER) by 28.3% and significantly improves pitch-accent detection F1 score, narrowing the gap with state-of-the-art methods by 41%. Crucially, it provides the first empirical validation—under low-resource fine-tuning—that explicit prosodic modeling enhances ASR robustness. The core innovation lies in embedding pitch accent as a structured prior into ASR pretraining, thereby strengthening the model’s capacity to perceive and exploit prosodic information.

Technology Category

Application Category

📝 Abstract

We show the performance of Automatic Speech Recognition (ASR) systems that use semi-supervised speech representations can be boosted by a complimentary pitch accent detection module, by introducing a joint ASR and pitch accent detection model. The pitch accent detection component of our model achieves a significant improvement on the state-of-the-art for the task, closing the gap in F1-score by 41%. Additionally, the ASR performance in joint training decreases WER by 28.3% on LibriSpeech, under limited resource fine-tuning. With these results, we show the importance of extending pretrained speech models to retain or re-learn important prosodic cues such as pitch accent.

Problem

Research questions and friction points this paper is trying to address.

Improving ASR with pitch accent detection

Closing F1-score gap by 41 percent

Reducing WER by 28.3 percent

Innovation

Methods, ideas, or system contributions that make the work stand out.

Joint ASR and pitch accent detection model

Semi-supervised speech representations enhancement

Limited resource fine-tuning efficiency

🔎 Similar Papers

No similar papers found.