Evaluating Morphological Alignment of Tokenizers in 70 Languages

📅 2025-07-08

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This study investigates the efficacy of tokenizers in preserving morphologically plausible subword boundaries—i.e., the alignment between tokenizer-induced segmentation points and linguistically motivated morpheme boundaries—in multilingual settings. We extend the MorphScore evaluation framework to 70 languages, enabling the first large-scale, cross-lingual measurement of morphological alignment. We systematically analyze its correlation with downstream task performance across seven diverse NLP benchmarks. Experiments span multiple mainstream pretrained language models. Results reveal only weak overall correlation between morphological alignment and model performance, suggesting that while morphological plausibility is linguistically meaningful, it captures only a limited aspect of subword segmentation quality critical for model effectiveness. Our work overcomes prior limitations of MorphScore in language coverage and methodological scalability, establishing a more generalizable and extensible linguistic perspective for tokenizer evaluation.

Technology Category

Application Category

📝 Abstract

While tokenization is a key step in language modeling, with effects on model training and performance, it remains unclear how to effectively evaluate tokenizer quality. One proposed dimension of tokenizer quality is the extent to which tokenizers preserve linguistically meaningful subwords, aligning token boundaries with morphological boundaries within a word. We expand MorphScore (Arnett & Bergen, 2025), which previously covered 22 languages, to support a total of 70 languages. The updated MorphScore offers more flexibility in evaluation and addresses some of the limitations of the original version. We then correlate our alignment scores with downstream task performance for five pre-trained languages models on seven tasks, with at least one task in each of the languages in our sample. We find that morphological alignment does not explain very much variance in model performance, suggesting that morphological alignment alone does not measure dimensions of tokenization quality relevant to model performance.

Problem

Research questions and friction points this paper is trying to address.

Evaluating tokenizer quality across 70 languages

Measuring alignment of token boundaries with morphological boundaries

Assessing impact of morphological alignment on model performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Expanded MorphScore to 70 languages

Correlated alignment with task performance

Evaluated tokenizer morphological boundary alignment

🔎 Similar Papers

Unsupervised Morphological Tree Tokenizer