🤖 AI Summary
Existing self-supervised speech models lack direct evaluation methods sensitive to prosodic contrasts such as lexical stress, pitch accent, and tone. This work proposes Prosodic ABX, an extension of the classic ABX discriminability task, which enables cross-lingual assessment of a model’s ability to distinguish prosodic differences using only a small amount of unlabeled data. Requiring no explicit prosodic annotations, the method constitutes the first language-agnostic, low-resource approach to evaluating prosodic contrast sensitivity. The authors also release a novel dataset of English–Japanese minimal prosodic pairs. Experiments on English stress, Japanese pitch accent, and Mandarin tone tasks demonstrate the method’s effectiveness, revealing consistent performance rankings across models and network layers under varying conditions and confirming its suitability for low-resource settings.
📝 Abstract
Speech representations from self-supervised speech models (S3Ms) are known to be sensitive to phonemic contrasts, but their sensitivity to prosodic contrasts has not been directly measured. The ABX discrimination task has been used to measure phonemic contrast in S3M representations via minimal pairs. We introduce prosodic ABX, an extension of this framework to evaluate prosodic contrast with only a handful of examples and no explicit labels. Also, we build and release a dataset of English and Japanese minimal pairs and use it along with a Mandarin dataset to evaluate contrast in English stress, Japanese pitch accent, and Mandarin tone. Finally, we show that model and layer rankings are often preserved across several experimental conditions, making it practical for low-resource settings.