Language Models Are Borrowing-Blind: A Multilingual Evaluation of Loanword Identification across 10 Languages

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This study presents the first systematic evaluation of pretrained language models’ (PLMs) ability to identify loanwords across ten typologically diverse languages, probing whether such models can reliably distinguish loanwords from native words. We design a multilingual loanword identification task and conduct controlled experiments on both mainstream and low-resource PLMs, employing explicit instructions and contextual prompting. Results reveal a pervasive bias: models consistently over-predict loanword status—particularly in minority languages heavily influenced by dominant languages like English—exposing a systemic failure in native word recognition. This work uncovers a fundamental limitation of PLMs in modeling language contact phenomena and provides empirical evidence that current models inadequately capture linguistic diversity and cultural specificity. Beyond diagnosing this gap, our study introduces a novel, multilingual evaluation framework for assessing lexical nativeness—a critical yet overlooked dimension in NLP robustness and fairness research.

Technology Category

Application Category

📝 Abstract

Throughout language history, words are borrowed from one language to another and gradually become integrated into the recipient's lexicon. Speakers can often differentiate these loanwords from native vocabulary, particularly in bilingual communities where a dominant language continuously imposes lexical items on a minority language. This paper investigates whether pretrained language models, including large language models, possess similar capabilities for loanword identification. We evaluate multiple models across 10 languages. Despite explicit instructions and contextual information, our results show that models perform poorly in distinguishing loanwords from native ones. These findings corroborate previous evidence that modern NLP systems exhibit a bias toward loanwords rather than native equivalents. Our work has implications for developing NLP tools for minority languages and supporting language preservation in communities under lexical pressure from dominant languages.

Problem

Research questions and friction points this paper is trying to address.

Evaluating language models' ability to identify loanwords across multiple languages

Investigating whether pretrained models can distinguish borrowed from native vocabulary

Addressing NLP system bias toward loanwords in minority language preservation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated loanword identification across ten languages

Tested pretrained models including large language models

Found models biased toward loanwords over native words

🔎 Similar Papers

Exploring Multilingual Probing in Large Language Models: A Cross-Language Analysis