🤖 AI Summary
This work addresses the limitation of existing subword tokenization methods, which disregard phonetic information and thereby impair language models’ ability to represent phonological knowledge—such as rhyme patterns and syllable structure. To quantify a tokenizer’s sensitivity to phonology, the authors propose Syllable-Token Alignment Distance (STAD) as a diagnostic metric. They further introduce a lightweight fine-tuning approach guided by International Phonetic Alphabet (IPA) representations, designed to enhance phonological awareness without substantially compromising general-purpose capabilities. Experimental results demonstrate consistent performance gains across three phonology-related tasks, while incurring only minor drops of 1.1% and 0.9% on the GSM8K and MMLU benchmarks, respectively, thus effectively balancing task-specific proficiency with broad linguistic competence.
📝 Abstract
Tokenization is the first step in every language model (LM), yet it never takes the sounds of words into account. We investigate how tokenization influences text-only LMs' ability to represent phonological knowledge. Through a series of probing experiments, we show that subword-based tokenization systematically weakens the encoding of both local (e.g., rhyme) and global (e.g., syllabification) phonological features. To quantify this effect, we introduce the syllabification-tokenization alignment distance (STAD), a metric that measures the misalignment between a model's tokenization and the natural syllable boundaries of words, and find that higher misalignment correlates with poorer phonological representations, providing a simple diagnostic for phonology-aware tokenization. To address these limitations, we propose a lightweight IPA-based fine-tuning method that infuses phonological awareness into LMs, leading to consistent improvements across three phonology-related tasks while largely preserving math and general reasoning ability, with 1.1\% and 0.9\% drops on GSM8K and MMLU, respectively.