A Typologically Grounded Evaluation Framework for Word Order and Morphology Sensitivity in Multilingual Masked LMs

📅 2026-02-27

📈 Citations: 0

✨ Influential: 0

career value

137K/year

🤖 AI Summary

This study addresses the lack of typological grounding in existing methods for evaluating multilingual masked language models’ reliance on word order and morphology. It proposes the first diagnostic framework that integrates linguistic typology with systematic perturbations: leveraging Universal Dependencies corpora, it applies sentence-wide shuffling, content-word reordering, dependency swapping, and token substitution during inference, and introduces novel metrics such as +L (measuring the contribution of contextualized tokens to target prediction) to assess word reconstruction performance of mBERT and XLM-R across diverse languages. Results reveal that full-sentence shuffling drives accuracy near zero universally, while +L has minimal impact on Chinese but substantially degrades performance in German, Spanish, and Russian—highlighting cross-linguistic differences in models’ dependence on structural and morphological cues.

Technology Category

Application Category

📝 Abstract

We introduce a typology-aware diagnostic for multilingual masked language models that tests reliance on word order versus inflectional form. Using Universal Dependencies, we apply inference-time perturbations: full token scrambling, content-word scrambling with function words fixed, dependency-based head--dependent swaps, and sentence-level lemma substitution (+L), which lemmatizes both the context and the masked target label. We evaluate mBERT and XLM-R on English, Chinese, German, Spanish, and Russian. Full scrambling drives word-level reconstruction accuracy near zero in all languages; partial and head--dependent perturbations cause smaller but still large drops. +L has little effect in Chinese but substantially lowers accuracy in German/Spanish/Russian, and it does not mitigate the impact of scrambling. Top-5 word accuracy shows the same pattern: under full scrambling, the gold word rarely appears among the five highest-ranked reconstructions. We release code, sampling scripts, and balanced evaluation subsets; Turkish results under strict reconstruction are reported in the appendix.

Problem

Research questions and friction points this paper is trying to address.

word order

morphology

multilingual masked language models

typology

inflectional form

Innovation

Methods, ideas, or system contributions that make the work stand out.

typology-aware evaluation

word order sensitivity

morphological sensitivity