🤖 AI Summary
This study presents the first systematic evaluation of pretrained language models’ (PLMs) ability to identify loanwords across ten typologically diverse languages, probing whether such models can reliably distinguish loanwords from native words. We design a multilingual loanword identification task and conduct controlled experiments on both mainstream and low-resource PLMs, employing explicit instructions and contextual prompting. Results reveal a pervasive bias: models consistently over-predict loanword status—particularly in minority languages heavily influenced by dominant languages like English—exposing a systemic failure in native word recognition. This work uncovers a fundamental limitation of PLMs in modeling language contact phenomena and provides empirical evidence that current models inadequately capture linguistic diversity and cultural specificity. Beyond diagnosing this gap, our study introduces a novel, multilingual evaluation framework for assessing lexical nativeness—a critical yet overlooked dimension in NLP robustness and fairness research.
📝 Abstract
Throughout language history, words are borrowed from one language to another and gradually become integrated into the recipient's lexicon. Speakers can often differentiate these loanwords from native vocabulary, particularly in bilingual communities where a dominant language continuously imposes lexical items on a minority language. This paper investigates whether pretrained language models, including large language models, possess similar capabilities for loanword identification. We evaluate multiple models across 10 languages. Despite explicit instructions and contextual information, our results show that models perform poorly in distinguishing loanwords from native ones. These findings corroborate previous evidence that modern NLP systems exhibit a bias toward loanwords rather than native equivalents. Our work has implications for developing NLP tools for minority languages and supporting language preservation in communities under lexical pressure from dominant languages.