🤖 AI Summary
This work investigates the dynamic evolution of subword segmentation during language model training, focusing on how subword boundaries shift across pretraining and fine-tuning for agglutinative, analytic, and intermediate-type languages. To address this, we propose the Subword Segment Language Model (SSLM), a learnable subword tokenization framework featuring a four-stage subword learning paradigm that explicitly models subword evolution through linguistic dimensions such as morphological productivity and fertility. Experimental results reveal that morphologically complex languages exhibit greater subword boundary instability, with fine-tuning driving segmentation toward finer granularity. Evaluations on isiXhosa, Setswana, and English demonstrate that SSLM significantly improves text generation quality and cross-lingual transfer performance—particularly for low-resource languages. The approach establishes a more robust subword modeling paradigm tailored to morphologically rich and data-scarce settings.
📝 Abstract
Subword segmentation is typically applied in preprocessing and stays fixed during training. Alternatively, it can be learned during training to optimise the training objective. In this paper we study the learning dynamics of subword segmentation: if a language model can dynamically optimise tokenisation, how do its subwords evolve during pretraining and finetuning? To explore this, we extend the subword segmental language model (SSLM), a framework for learning subwords during training, to support pretraining and finetuning. We train models for three typologically diverse languages to study learning dynamics across the morphological spectrum: Isi-Xhosa is conjunctive (long word forms composed of many morphemes), Setswana is disjunctive (morphemes written as separate words), and English represents a typological middle ground. We analyse subword dynamics from a linguistic perspective, tracking morphology, productivity, and fertility. We identify four stages of subword learning, with the morphologically complex isi-Xhosa exhibiting greater instability. During finetuning, subword boundaries shift to become finer-grained. Lastly, we show that learnable subwords offers a promising approach to improve text generation and cross-lingual transfer for low-resource, morphologically complex languages.