🤖 AI Summary
This work investigates the morphological compositional generalization capabilities of large language models (LLMs) in agglutinative languages—specifically Turkish and Finnish—with emphasis on their ability to model novel roots combined with multi-layered affixation. Method: Addressing a critical gap in prior research, we formally define “morphemes” as compositional primitives and introduce the first benchmark explicitly designed to evaluate derivational productivity and systematicity; it comprises generative and discriminative tasks, leveraging manually constructed morphological perturbations and controlled-variable experiments to assess multilingual LLMs including GPT-4 and Gemini. Contribution/Results: Results show that LLM accuracy on novel roots degrades sharply with increasing morphological complexity—marginally exceeding random baselines but falling substantially short of human-level systematicity. The study exposes a fundamental limitation in LLMs’ generalization over underlying linguistic structure and establishes a new paradigm for evaluating symbolic compositional competence in language models.
📝 Abstract
Large language models (LLMs) have demonstrated significant progress in various natural language generation and understanding tasks. However, their linguistic generalization capabilities remain questionable, raising doubts about whether these models learn language similarly to humans. While humans exhibit compositional generalization and linguistic creativity in language use, the extent to which LLMs replicate these abilities, particularly in morphology, is under-explored. In this work, we systematically investigate the morphological generalization abilities of LLMs through the lens of compositionality. We define morphemes as compositional primitives and design a novel suite of generative and discriminative tasks to assess morphological productivity and systematicity. Focusing on agglutinative languages such as Turkish and Finnish, we evaluate several state-of-the-art instruction-finetuned multilingual models, including GPT-4 and Gemini. Our analysis shows that LLMs struggle with morphological compositional generalization particularly when applied to novel word roots, with performance declining sharply as morphological complexity increases. While models can identify individual morphological combinations better than chance, their performance lacks systematicity, leading to significant accuracy gaps compared to humans.