🤖 AI Summary
This study investigates why certain languages, such as Czech, permit flexible word order while others, like English, exhibit rigid constraints. By pretraining Transformer-based language models on a range of synthetically generated syntactic variants of natural languages, the authors systematically examine how word order irregularity affects learnability. Their findings challenge the traditional binary distinction between “free” and “fixed” word order, revealing instead that lexical and subword morphological structure are key predictors of the difficulty in acquiring word order patterns. Experiments demonstrate that increased word order irregularity substantially elevates model surprisal (i.e., perplexity), whereas simple sentence reversal has minimal impact, underscoring the central role of morphological structure in shaping how language models acquire syntactic regularities.
📝 Abstract
Why do some languages like Czech permit free word order, while others like English do not? We address this question by pretraining transformer language models on a spectrum of synthetic word-order variants of natural languages. We observe that greater word-order irregularity consistently raises model surprisal, indicating reduced learnability. Sentence reversal, however, affects learnability only weakly. A coarse distinction of free- (e.g., Czech and Finnish) and fixed-word-order languages (e.g., English and French) does not explain cross-lingual variation. Instead, the structure of the word and subword vocabulary strongly predicts the model surprisal. Overall, vocabulary structure emerges as a key driver of computational word-order learnability across languages.