🤖 AI Summary
This study addresses the quantification of cross-lingual intelligibility among Romance languages—French, Italian, Portuguese, Spanish, and Romanian—with a particular focus on its asymmetry. The work proposes a novel computational metric that integrates orthographic, phonetic, and multi-source semantic embeddings to jointly capture lexical surface and semantic similarity. Leveraging parallel corpora and word embedding models, the resulting intelligibility scores exhibit strong correlation with human performance in cloze-task experiments, effectively capturing and validating the asymmetric nature of mutual intelligibility across these languages. This approach offers a scalable, data-driven paradigm for modeling cross-lingual comprehension, advancing both theoretical understanding and practical applications in multilingual NLP.
📝 Abstract
We present an analysis of mutual intelligibility in related languages applied for languages in the Romance family. We introduce a novel computational metric for estimating intelligibility based on lexical similarity using surface and semantic similarity of related words, and use it to measure mutual intelligibility for the five main Romance languages (French, Italian, Portuguese, Spanish, and Romanian), and compare results using both the orthographic and phonetic forms of words as well as different parallel corpora and vectorial models of word meaning representation. The obtained intelligibility scores confirm intuitions related to intelligibility asymmetry across languages and significantly correlate with results of cloze tests in human experiments.