Zero-Shot Tokenizer Transfer

📅 2024-05-13
🏛️ Neural Information Processing Systems
📈 Citations: 22
Influential: 2
📄 PDF
🤖 AI Summary
This work addresses the strong coupling between large language models (LLMs) and fixed tokenizers by introducing zero-shot tokenizer transfer (ZeTT): dynamically substituting arbitrary tokenizers into a pretrained LLM without backbone fine-tuning while preserving performance. The core challenge lies in zero-shot initialization of high-quality token embeddings for unseen tokenizer vocabularies. To this end, the authors propose the first hypernetwork that takes tokenizer meta-representations as input and generates corresponding token embeddings—enabling cross-architectural generalization across encoder and decoder models. Experiments on XLM-R and Mistral-7B demonstrate that ZeTT achieves near-original-model performance with significantly shorter token sequences; full performance recovery requires only <1B tokens of fine-tuning, and the method remains compatible with downstream-finetuned models. This work establishes a general framework for decoupling tokenizers from LLMs, enhancing sequence efficiency and generalization—particularly in multilingual and programming-language settings.

Technology Category

Application Category

📝 Abstract
Language models (LMs) are bound to their tokenizer, which maps raw text to a sequence of vocabulary items (tokens). This restricts their flexibility: for example, LMs trained primarily on English may still perform well in other natural and programming languages, but have vastly decreased efficiency due to their English-centric tokenizer. To mitigate this, we should be able to swap the original LM tokenizer with an arbitrary one, on the fly, without degrading performance. Hence, in this work we define a new problem: Zero-Shot Tokenizer Transfer (ZeTT). The challenge at the core of ZeTT is finding embeddings for the tokens in the vocabulary of the new tokenizer. Since prior heuristics for initializing embeddings often perform at chance level in a ZeTT setting, we propose a new solution: we train a hypernetwork taking a tokenizer as input and predicting the corresponding embeddings. We empirically demonstrate that the hypernetwork generalizes to new tokenizers both with encoder (e.g., XLM-R) and decoder LLMs (e.g., Mistral-7B). Our method comes close to the original models' performance in cross-lingual and coding tasks while markedly reducing the length of the tokenized sequence. We also find that the remaining gap can be quickly closed by continued training on less than 1B tokens. Finally, we show that a ZeTT hypernetwork trained for a base (L)LM can also be applied to fine-tuned variants without extra training. Overall, our results make substantial strides toward detaching LMs from their tokenizer.
Problem

Research questions and friction points this paper is trying to address.

Transferring language models to new tokenizers without performance loss
Finding embeddings for new tokenizer vocabularies in zero-shot setting
Enabling flexible tokenizer swaps to improve cross-lingual efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hypernetwork predicts embeddings for new tokenizers
Method reduces tokenized sequence length efficiently
Approach enables quick adaptation with minimal training
🔎 Similar Papers
No similar papers found.
Benjamin Minixhofer
Benjamin Minixhofer
Phd Student, University of Cambridge & Ai2
Natural Language ProcessingRepresentation Learning
E
E. Ponti
University of Edinburgh
I
Ivan Vuli'c
University of Edinburgh