Multilingual Tokenization through the Lens of Indian Languages: Challenges and Insights

📅 2025-06-21

📈 Citations: 0

✨ Influential: 0

career value

144K/year

🤖 AI Summary

Existing tokenizers exhibit limited performance on morphologically rich, low-resource languages—particularly those of the Indian subcontinent. This paper systematically evaluates Byte-Pair Encoding (BPE) and Unigram Language Model tokenization across 17 Indian languages, analyzing the impact of vocabulary size, vocabulary construction strategy (joint vs. clustered training), and cross-lingual transfer on low-resource settings. We propose a novel language-relatedness-aware cross-lingual tokenization framework, demonstrating that pretraining tokenizers on high-resource typologically related languages significantly improves segmentation quality for low-resource counterparts. Experiments reveal heightened sensitivity of morphologically complex languages to subword segmentation algorithms and empirically validate the critical role of genealogical and typological language relationships in multilingual tokenization. Our findings provide both theoretical grounding and practical methodologies for fair, efficient, and resource-adaptive multilingual tokenization.

Technology Category

Application Category

📝 Abstract

Tokenization plays a pivotal role in multilingual NLP. However, existing tokenizers are often skewed towards high-resource languages, limiting their effectiveness for linguistically diverse and morphologically rich languages such as those in the Indian subcontinent. This paper presents a comprehensive intrinsic evaluation of tokenization strategies across 17 Indian languages. We quantify the trade-offs between bottom-up and top-down tokenizer algorithms (BPE and Unigram LM), effects of vocabulary sizes, and compare strategies of multilingual vocabulary construction such as joint and cluster-based training. We also show that extremely low-resource languages can benefit from tokenizers trained on related high-resource languages. Our study provides practical insights for building more fair, efficient, and linguistically informed tokenizers for multilingual NLP.

Problem

Research questions and friction points this paper is trying to address.

Evaluates tokenization challenges in Indian languages

Compares tokenizer algorithms and vocabulary strategies

Improves tokenizers for low-resource related languages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates BPE and Unigram LM tokenizers

Compares joint and cluster-based vocabulary training

Leverages high-resource languages for low-resource ones

🔎 Similar Papers

Krutrim LLM: A Novel Tokenization Strategy for Multilingual Indic Languages with Petabyte-Scale Data Processing