🤖 AI Summary
Multilingual large language models (LLMs) face critical bottlenecks—low tokenization efficiency (high token-to-word ratio), suboptimal context utilization, and slow inference—when processing languages with high script diversity and complex orthographies, such as Indian languages. To address these challenges, we propose a multilingual-optimized tokenizer design framework comprising three core components: (i) a corpus balancing algorithm to enhance script and language representativeness in training data; (ii) a systematic analysis of pre-tokenization strategies, augmented with linguistically motivated rule-based optimizations; and (iii) joint tuning of vocabulary size and pre-tokenization logic. Evaluated on state-of-the-art Indian-language LMs, our approach reduces the average token-to-word ratio by over 40% and significantly accelerates inference. These results demonstrate that efficient, linguistically grounded tokenization is a pivotal lever for building scalable, high-performance multilingual LMs.
📝 Abstract
While model architecture and training objectives are well-studied, tokenization, particularly in multilingual contexts, remains a relatively neglected aspect of Large Language Model (LLM) development. Existing tokenizers often exhibit high token-to-word ratios, inefficient use of context length, and slower inference. We present a systematic study that links vocabulary size, pre-tokenization rules, and training-corpus composition to both token-to-word efficiency and model quality. To ground our analysis in a linguistically diverse context, we conduct extensive experiments on Indic scripts, which present unique challenges due to their high script diversity and orthographic complexity. Drawing on the insights from these analyses, we propose a novel algorithm for data composition that balances multilingual data for tokenizer training. Our observations on pretokenization strategies significantly improve model performance, and our data composition algorithm reduces the average token-to-word ratio by approximately 6% with respect to the conventional data randomization approach. Our tokenizer achieves more than 40% improvement on average token-to-word ratio against stateof-the-art multilingual Indic models. This improvement yields measurable gains in both model performance and inference speed. This highlights tokenization alongside architecture and training objectives as a critical lever for building efficient, scalable multilingual LLMs