🤖 AI Summary
This study investigates how Transformer models represent the complex verbal morphology of Turkish, an agglutinative language, and Hebrew, a non-concatenative fusional language, with a focus on the impact of tokenization strategies. Using the Blackbird linguistic probing task, the authors evaluate monolingual and multilingual Transformer models on both natural and synthetic data, assessing their ability to model verb paradigms under atomic, subword, and character-level tokenization. Results show that Turkish exhibits robust performance across all tokenization schemes, whereas Hebrew achieves significant gains only when using monolingual models combined with morphology-aware tokenization. Synthetic data further enhances performance across all settings. These findings highlight a critical interaction between tokenization granularity and morphological typology, underscoring the importance of morphology-sensitive tokenization for non-concatenative languages.
📝 Abstract
We investigate how transformer models represent complex verb paradigms in Turkish and Modern Hebrew, concentrating on how tokenization strategies shape this ability. Using the Blackbird Language Matrices task on natural data, we show that for Turkish -- with its transparent morphological markers -- both monolingual and multilingual models succeed, either when tokenization is atomic or when it breaks words into small subword units. For Hebrew, instead, monolingual and multilingual models diverge. A multilingual model using character-level tokenization fails to capture the language non-concatenative morphology, but a monolingual model with morpheme-aware segmentation performs well. Performance improves on more synthetic datasets, in all models.