FLEXITOKENS: Flexible Tokenization for Evolving Language Models

📅 2025-07-16

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

Predefined subword tokenizers in language models cause excessive token fragmentation in out-of-distribution scenarios—such as emerging languages, scripts, or domains—thereby limiting generalization. To address this, we propose FLEXITOKENS, a byte-level learnable tokenization method that eliminates auxiliary losses enforcing fixed compression ratios and instead jointly optimizes language modeling and variable-length boundary prediction via end-to-end training. Its core innovation is a lightweight, learnable segmentation module that enables dynamic, adaptive text segmentation. Evaluated across multilingual, morphologically complex, and cross-domain benchmarks, FLEXITOKENS significantly reduces token fragmentation and improves downstream task performance by up to 10%. This work establishes a new paradigm for enhancing tokenizer plasticity in large language models, moving beyond static, precomputed vocabularies toward fully differentiable, context-aware tokenization.

Technology Category

Application Category

📝 Abstract

Language models (LMs) are challenging to adapt to new data distributions by simple finetuning. This is due to the rigidity of their subword tokenizers, which typically remain unchanged during adaptation. This inflexibility often leads to inefficient tokenization, causing overfragmentation of out-of-distribution domains, unseen languages, or scripts. In this work, we develop byte-level LMs with learnable tokenizers to make tokenization adaptive. Our models include a submodule that learns to predict boundaries between the input byte sequence, encoding it into variable-length segments. Existing tokenizer-free methods train this boundary predictor using an auxiliary loss that enforces a fixed compression rate across the training corpus, introducing a new kind of rigidity. We propose FLEXITOKENS, a simplified training objective that enables significantly greater flexibility during adaptation. Evaluating across multiple multilingual benchmarks, morphologically diverse tasks, and domains, we demonstrate that FLEXITOKENS consistently reduces token over-fragmentation and achieves up to 10% improvements on downstream task performance compared to subword and other gradient-based tokenizers. Code and data for our experiments will be released at https://github.com/owos/flexitokens

Problem

Research questions and friction points this paper is trying to address.

Adapting language models to new data distributions

Overcoming rigidity in subword tokenizers during adaptation

Reducing token over-fragmentation in diverse domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

Byte-level LMs with learnable tokenizers

Simplified training objective for flexibility

Reduces token over-fragmentation effectively

🔎 Similar Papers

No similar papers found.