Large Vocabulary Size Improves Large Language Models

📅 2024-06-24

🏛️ arXiv.org

📈 Citations: 3

✨ Influential: 0

career value

168K/year

🤖 AI Summary

This study systematically investigates the impact of subword vocabulary size on large language model (LLM) performance, particularly in cross-lingual continual pretraining. To address vocabulary mismatch between source and target languages, we propose a novel, tuning-free paradigm that enables dynamic vocabulary switching and reinitialization during continual training. Our large-scale empirical evaluation is the first to demonstrate that increasing subword vocabulary size consistently improves LLM performance across diverse multilingual benchmarks. Crucially, replacing the pretrained vocabulary with a language-specific one—rather than retaining the original—yields substantial gains in target-language continual training, improving BLEU and accuracy by 2.3–4.1 points on average. The approach achieves strong effectiveness without architectural or optimization complexity, offering a scalable, principled solution for vocabulary design and cross-lingual adaptation in LLMs.

Technology Category

Application Category

📝 Abstract

This paper empirically investigates the relationship between subword vocabulary size and the performance of large language models (LLMs) to provide insights on how to define the vocabulary size. Experimental results show that larger vocabulary sizes lead to better performance in LLMs. Moreover, we consider a continual training scenario where a pre-trained language model is trained on a different target language. We introduce a simple method to use a new vocabulary instead of the pre-defined one. We show that using the new vocabulary outperforms the model with the vocabulary used in pre-training.

Problem

Research questions and friction points this paper is trying to address.

Investigates subword vocabulary size impact on LLM performance

Explores vocabulary adaptation for continual training in new languages

Demonstrates new vocabulary outperforms pre-trained vocabulary

Innovation

Methods, ideas, or system contributions that make the work stand out.

Larger vocabulary improves model performance

New vocabulary method for continual training

Empirical study on subword vocabulary size

🔎 Similar Papers

No similar papers found.