IndicSuperTokenizer: An Optimized Tokenizer for Indic Multilingual LLMs

📅 2025-11-05

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

To address the inefficiency of existing subword tokenizers in Indian multilingual large language models—caused by high script diversity and morphological complexity—this paper proposes the first dual-track tokenization framework integrating Byte-Pair Encoding (BPE) and multi-word tokenization, augmented with language-specific pre-tokenization strategies. The framework jointly optimizes fine-grained linguistic modeling and semantic integrity within a unified architecture, significantly improving tokenizer adaptability across diverse Indian languages. Experiments on 22 Indian languages, English, and code data demonstrate an average 39.5% increase in vocabulary generation rate over LLaMA4, a 44% improvement in inference throughput, and competitive performance on multiple downstream benchmarks. This work constitutes the first systematic integration of multi-granularity tokenization mechanisms, establishing a novel, efficient, and scalable tokenization paradigm for resource-heterogeneous multilingual settings.

Technology Category

Application Category

📝 Abstract

Tokenizers play a crucial role in determining the performance, training efficiency, and the inference cost of Large Language Models (LLMs). Designing effective tokenizers for multilingual LLMs is particularly challenging due to diverse scripts and rich morphological variation. While subword methods such as Byte Pair Encoding (BPE) are widely adopted, their effectiveness in multilingual settings remains underexplored. We present IndicSuperTokenizer, a tokenizer for Indic multilingual LLMs, that combines both subword and multi-word tokenization, along with language-specific pre-tokenization, leading to more linguistically aligned tokens and achieving a new state-of-the-art in fertility score. Evaluated across English, 22 Indian languages and code data, our tokenizer improves the average fertility score by 39.5% over LLaMA4 and by 18% over Sutra (the current best). This translates to 44% improvement in inference throughput over LLaMA4 while maintaining comparable performance on English and Indic benchmarks. We also present detailed ablations across tokenizer training data size, vocabulary size, merging techniques, and pre-tokenization strategies, demonstrating the robustness of our design choices.

Problem

Research questions and friction points this paper is trying to address.

Optimizing tokenizer performance for multilingual Indic language models

Addressing script diversity and morphological variation in tokenization

Improving fertility scores and inference throughput over existing methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Combines subword and multi-word tokenization methods

Uses language-specific pre-tokenization for linguistic alignment

Achieves state-of-the-art fertility score improvements

🔎 Similar Papers

Krutrim LLM: A Novel Tokenization Strategy for Multilingual Indic Languages with Petabyte-Scale Data Processing