How does a Language-Specific Tokenizer affect LLMs?

📅 2025-02-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the impact of language-specific tokenizers on English-pretrained large language models (LLMs) in non-English settings, using Korean as a case study. To address Korean’s morphological complexity, the authors design and integrate subword-level extended tokenizers and systematically evaluate their performance on next-token prediction. Empirical results—first of their kind—demonstrate that language-specialized tokenizers, compared to generic ones, significantly reduce confidence in incorrect predictions, lower cross-entropy on complex tasks, suppress hallucinated outputs, improve coherence and plausibility of generated text, and enhance robustness on downstream tasks. The study reveals the critical regulatory role of tokenizer language adaptivity in shaping LLM behavior, establishing a reproducible methodology and empirical foundation for cross-lingual LLM optimization.

Technology Category

Application Category

📝 Abstract
The necessity of language-specific tokenizers intuitively appears crucial for effective natural language processing, yet empirical analyses on their significance and underlying reasons are lacking. This study explores how language-specific tokenizers influence the behavior of Large Language Models predominantly trained with English text data, through the case study of Korean. The research unfolds in two main stages: (1) the development of a Korean-specific extended tokenizer and (2) experiments to compare models with the basic tokenizer and the extended tokenizer through various Next Token Prediction tasks. Our in-depth analysis reveals that the extended tokenizer decreases confidence in incorrect predictions during generation and reduces cross-entropy in complex tasks, indicating a tendency to produce less nonsensical outputs. Consequently, the extended tokenizer provides stability during generation, potentially leading to higher performance in downstream tasks.
Problem

Research questions and friction points this paper is trying to address.

Explores language-specific tokenizers' impact on LLMs
Compares basic and extended tokenizers in NLP tasks
Analyzes tokenizers' effect on prediction confidence and accuracy
Innovation

Methods, ideas, or system contributions that make the work stand out.

Korean-specific extended tokenizer development
Comparison through Next Token Prediction tasks
Reduced cross-entropy in complex tasks
🔎 Similar Papers
No similar papers found.
J
Jean Seo
Seoul National University
Jaeyoon Kim
Jaeyoon Kim
KAIST
Computer VisionImage Retrieval
S
Sungjoo Byun
Seoul National University
H
Hyopil Shin
Seoul National University