🤖 AI Summary
Cross-lingual speech modeling for low-resource languages remains challenging due to the lack of principled criteria for selecting optimal source languages. Method: This paper introduces the first phonology-distance-based framework for quantifying cross-lingual phonetic similarity, guiding both source-language selection and multilingual joint training. It integrates phonological modeling, phoneme-level recognition evaluation, and genealogical structure analysis. Contribution/Results: We demonstrate a strong positive correlation between intra-family phonological similarity and model performance; critically, cross-family language pairs with high phonological proximity outperform large self-supervised models (e.g., Wav2Vec 2.0). On phoneme recognition, our approach achieves a 55.6% relative improvement over monolingual baselines and significantly surpasses state-of-the-art models. High-similarity language combinations yield substantial gains, whereas low-similarity ones degrade performance—validating phonological distance as an effective, generalizable metric for cross-lingual transfer in speech modeling.
📝 Abstract
This paper examines how linguistic similarity affects cross-lingual phonetic representation in speech processing for low-resource languages, emphasizing effective source language selection. Previous cross-lingual research has used various source languages to enhance performance for the target low-resource language without thorough consideration of selection. Our study stands out by providing an in-depth analysis of language selection, supported by a practical approach to assess phonetic proximity among multiple language families. We investigate how within-family similarity impacts performance in multilingual training, which aids in understanding language dynamics. We also evaluate the effect of using phonologically similar languages, regardless of family. For the phoneme recognition task, utilizing phonologically similar languages consistently achieves a relative improvement of 55.6% over monolingual training, even surpassing the performance of a large-scale self-supervised learning model. Multilingual training within the same language family demonstrates that higher phonological similarity enhances performance, while lower similarity results in degraded performance compared to monolingual training.