How does a Language-Specific Tokenizer affect LLMs?

📅 2025-02-18

📈 Citations: 0

✨ Influential: 0

career value

143K/year

🤖 AI Summary

This work investigates the impact of language-specific tokenizers on English-pretrained large language models (LLMs) in non-English settings, using Korean as a case study. To address Korean’s morphological complexity, the authors design and integrate subword-level extended tokenizers and systematically evaluate their performance on next-token prediction. Empirical results—first of their kind—demonstrate that language-specialized tokenizers, compared to generic ones, significantly reduce confidence in incorrect predictions, lower cross-entropy on complex tasks, suppress hallucinated outputs, improve coherence and plausibility of generated text, and enhance robustness on downstream tasks. The study reveals the critical regulatory role of tokenizer language adaptivity in shaping LLM behavior, establishing a reproducible methodology and empirical foundation for cross-lingual LLM optimization.

Technology Category

Application Category

📝 Abstract

The necessity of language-specific tokenizers intuitively appears crucial for effective natural language processing, yet empirical analyses on their significance and underlying reasons are lacking. This study explores how language-specific tokenizers influence the behavior of Large Language Models predominantly trained with English text data, through the case study of Korean. The research unfolds in two main stages: (1) the development of a Korean-specific extended tokenizer and (2) experiments to compare models with the basic tokenizer and the extended tokenizer through various Next Token Prediction tasks. Our in-depth analysis reveals that the extended tokenizer decreases confidence in incorrect predictions during generation and reduces cross-entropy in complex tasks, indicating a tendency to produce less nonsensical outputs. Consequently, the extended tokenizer provides stability during generation, potentially leading to higher performance in downstream tasks.

Problem

Research questions and friction points this paper is trying to address.

Explores language-specific tokenizers' impact on LLMs

Compares basic and extended tokenizers in NLP tasks

Analyzes tokenizers' effect on prediction confidence and accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Korean-specific extended tokenizer development

Comparison through Next Token Prediction tasks

Reduced cross-entropy in complex tasks

🔎 Similar Papers

TokenRec: Learning to Tokenize ID for LLM-based Generative Recommendation