Vocab Diet: Reshaping the Vocabulary of LLMs with Vector Arithmetic

📅 2025-10-19

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Standard tokenization in large language models (LLMs) treats morphological variants (e.g., “walk”/“walked”) as distinct tokens, inefficiently consuming finite vocabulary capacity and impairing low-frequency word modeling and multilingual coverage. To address this, we propose a vector-arithmetic–based lexical recombination framework: leveraging the linear regularities of morphological variation in embedding space, we represent variants compositionally via additive combinations of base-word embeddings and learnable transformation vectors. This approach requires no modification to model weights—only input/output mapping is redefined—yet achieves stable downstream task performance while freeing approximately 10% of vocabulary capacity. Consequently, it significantly improves out-of-vocabulary word recognition and cross-lingual generalization. Extensive experiments across five languages and multiple state-of-the-art LLMs validate its effectiveness.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) were shown to encode word form variations, such as "walk"->"walked", as linear directions in embedding space. However, standard tokenization algorithms treat these variations as distinct tokens -- filling the size-capped vocabulary with surface form variants (e.g., "walk", "walking", "Walk"), at the expense of less frequent words and multilingual coverage. We show that many of these variations can be captured by transformation vectors -- additive offsets that yield the appropriate word's representation when applied to the base form word embedding -- in both the input and output spaces. Building on this, we propose a compact reshaping of the vocabulary: rather than assigning unique tokens to each surface form, we compose them from shared base form and transformation vectors (e.g., "walked" = "walk" + past tense). We apply our approach to multiple LLMs and across five languages, removing up to 10% of vocabulary entries -- thereby freeing space to allocate new, more diverse tokens. Importantly, we do so while also expanding vocabulary coverage to out-of-vocabulary words, with minimal impact on downstream performance, and without modifying model weights. Our findings motivate a foundational rethinking of vocabulary design, moving from string enumeration to a compositional vocabulary that leverages the underlying structure of language.

Problem

Research questions and friction points this paper is trying to address.

Reduces vocabulary bloat from word variations in LLMs

Enables compositional vocabulary using vector arithmetic

Expands multilingual coverage without modifying model weights

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses vector arithmetic for word transformations

Composes words from base forms and transformations

Reduces vocabulary size while expanding coverage

🔎 Similar Papers

Large Vocabulary Size Improves Large Language Models