Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

📅 2025-01-28

📈 Citations: 0

✨ Influential: 0

career value

120K/year

🤖 AI Summary

Conventional scaling paradigms for large language models (LLMs) focus exclusively on increasing parameter count, overlooking vocabulary size as an independent and critical scaling dimension. Method: We propose the Over-Tokenized Transformer, which decouples input and output vocabularies to support hundred-thousand-scale, multi-granularity subword tokenization, and introduces an adaptive output projection mechanism. Contribution/Results: Our theoretical analysis establishes, for the first time, a logarithmic-linear relationship between input vocabulary size and training loss. Empirically, under fixed compute budgets, models with expanded vocabularies achieve performance comparable to baselines with twice the parameter count. Comprehensive evaluations across multiple benchmarks confirm consistent and substantial gains from vocabulary scaling. This work identifies vocabulary size as a fundamental, orthogonal axis for LLM scaling—offering a new, computationally efficient pathway for designing high-performance language models.

Technology Category

Application Category

📝 Abstract

Tokenization is a fundamental component of large language models (LLMs), yet its influence on model scaling and performance is not fully explored. In this paper, we introduce Over-Tokenized Transformers, a novel framework that decouples input and output vocabularies to improve language modeling performance. Specifically, our approach scales up input vocabularies to leverage multi-gram tokens. Through extensive experiments, we uncover a log-linear relationship between input vocabulary size and training loss, demonstrating that larger input vocabularies consistently enhance model performance, regardless of model size. Using a large input vocabulary, we achieve performance comparable to double-sized baselines with no additional cost. Our findings highlight the importance of tokenization in scaling laws and provide practical insight for tokenizer design, paving the way for more efficient and powerful LLMs.

Problem

Research questions and friction points this paper is trying to address.

Vocabulary Size

Tokenization Methods

Large Language Model Performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hypernym Transformer

Vocabulary Enhancement

Model Efficiency

🔎 Similar Papers

Large Vocabulary Size Improves Large Language Models