Tokenization Disparities as Infrastructure Bias: How Subword Systems Create Inequities in LLM Access and Efficiency

📅 2025-10-14

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This study exposes structural computational inequality induced by subword tokenization in multilingual large language models (LLMs): non-Latin and morphologically complex languages suffer substantial token inflation, increasing computational cost, reducing effective context utilization, and exacerbating access barriers for low-resource language users. We systematically quantify tokenization inefficiency across 200+ languages—first of its kind—introducing cross-lingually comparable metrics such as Relative Tokenization Cost (RTC). Using standardized, sentence-level preprocessing and unified tiktoken-based token counting, we measure tokens per sentence (TPS) and RTC at scale. Experiments reveal RTC ≈ 1 for Latin-script languages, whereas several non-Latin and highly inflected languages exhibit RTC values of 3–5, confirming pronounced linguistic bias in current LLM infrastructure. Our findings provide critical empirical evidence and an evaluation framework to guide adaptive tokenization design and foster inclusive, multilingual AI systems.

Technology Category

Application Category

📝 Abstract

Tokenization disparities pose a significant barrier to achieving equitable access to artificial intelligence across linguistically diverse populations. This study conducts a large-scale cross-linguistic evaluation of tokenization efficiency in over 200 languages to systematically quantify computational inequities in large language models (LLMs). Using a standardized experimental framework, we applied consistent preprocessing and normalization protocols, followed by uniform tokenization through the tiktoken library across all language samples. Comprehensive tokenization statistics were collected using established evaluation metrics, including Tokens Per Sentence (TPS) and Relative Tokenization Cost (RTC), benchmarked against English baselines. Our cross-linguistic analysis reveals substantial and systematic disparities: Latin-script languages consistently exhibit higher tokenization efficiency, while non-Latin and morphologically complex languages incur significantly greater token inflation, often 3-5 times higher RTC ratios. These inefficiencies translate into increased computational costs and reduced effective context utilization for underrepresented languages. Overall, the findings highlight structural inequities in current AI systems, where speakers of low-resource and non-Latin languages face disproportionate computational disadvantages. Future research should prioritize the development of linguistically informed tokenization strategies and adaptive vocabulary construction methods that incorporate typological diversity, ensuring more inclusive and computationally equitable multilingual AI systems.

Problem

Research questions and friction points this paper is trying to address.

Subword tokenization creates inequitable computational costs across languages

Non-Latin and complex languages face 3-5 times higher tokenization costs

Current AI systems disadvantage speakers of low-resource languages structurally

Innovation

Methods, ideas, or system contributions that make the work stand out.

Standardized framework evaluates tokenization across 200 languages

Measures token inflation via Tokens Per Sentence metrics

Proposes adaptive vocabulary for equitable multilingual AI

🔎 Similar Papers

No similar papers found.

Authors to Follow