Language Models Are Borrowing-Blind: A Multilingual Evaluation of Loanword Identification across 10 Languages

📅 2025-10-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study presents the first systematic evaluation of pretrained language models’ (PLMs) ability to identify loanwords across ten typologically diverse languages, probing whether such models can reliably distinguish loanwords from native words. We design a multilingual loanword identification task and conduct controlled experiments on both mainstream and low-resource PLMs, employing explicit instructions and contextual prompting. Results reveal a pervasive bias: models consistently over-predict loanword status—particularly in minority languages heavily influenced by dominant languages like English—exposing a systemic failure in native word recognition. This work uncovers a fundamental limitation of PLMs in modeling language contact phenomena and provides empirical evidence that current models inadequately capture linguistic diversity and cultural specificity. Beyond diagnosing this gap, our study introduces a novel, multilingual evaluation framework for assessing lexical nativeness—a critical yet overlooked dimension in NLP robustness and fairness research.

Technology Category

Application Category

📝 Abstract
Throughout language history, words are borrowed from one language to another and gradually become integrated into the recipient's lexicon. Speakers can often differentiate these loanwords from native vocabulary, particularly in bilingual communities where a dominant language continuously imposes lexical items on a minority language. This paper investigates whether pretrained language models, including large language models, possess similar capabilities for loanword identification. We evaluate multiple models across 10 languages. Despite explicit instructions and contextual information, our results show that models perform poorly in distinguishing loanwords from native ones. These findings corroborate previous evidence that modern NLP systems exhibit a bias toward loanwords rather than native equivalents. Our work has implications for developing NLP tools for minority languages and supporting language preservation in communities under lexical pressure from dominant languages.
Problem

Research questions and friction points this paper is trying to address.

Evaluating language models' ability to identify loanwords across multiple languages
Investigating whether pretrained models can distinguish borrowed from native vocabulary
Addressing NLP system bias toward loanwords in minority language preservation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated loanword identification across ten languages
Tested pretrained models including large language models
Found models biased toward loanwords over native words
🔎 Similar Papers
No similar papers found.
M
Mérilin Sousa Silva
Department of Computational Linguistics, University of Zurich
Sina Ahmadi
Sina Ahmadi
University of Zurich
Natural Language ProcessingComputational Linguistics