Vocabulary shapes cross-lingual variation of word-order learnability in language models

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This study investigates why certain languages, such as Czech, permit flexible word order while others, like English, exhibit rigid constraints. By pretraining Transformer-based language models on a range of synthetically generated syntactic variants of natural languages, the authors systematically examine how word order irregularity affects learnability. Their findings challenge the traditional binary distinction between “free” and “fixed” word order, revealing instead that lexical and subword morphological structure are key predictors of the difficulty in acquiring word order patterns. Experiments demonstrate that increased word order irregularity substantially elevates model surprisal (i.e., perplexity), whereas simple sentence reversal has minimal impact, underscoring the central role of morphological structure in shaping how language models acquire syntactic regularities.

Technology Category

Application Category

📝 Abstract

Why do some languages like Czech permit free word order, while others like English do not? We address this question by pretraining transformer language models on a spectrum of synthetic word-order variants of natural languages. We observe that greater word-order irregularity consistently raises model surprisal, indicating reduced learnability. Sentence reversal, however, affects learnability only weakly. A coarse distinction of free- (e.g., Czech and Finnish) and fixed-word-order languages (e.g., English and French) does not explain cross-lingual variation. Instead, the structure of the word and subword vocabulary strongly predicts the model surprisal. Overall, vocabulary structure emerges as a key driver of computational word-order learnability across languages.

Problem

Research questions and friction points this paper is trying to address.

word order

learnability

vocabulary structure

cross-lingual variation

language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

vocabulary structure

word-order learnability

cross-lingual variation