How Long Is a Piece of String? A Brief Empirical Analysis of Tokenizers

📅 2026-01-16

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This study addresses the widespread assumption in large language models that tokens serve as stable units of measurement, despite significant variation in token length across different tokenizers and text domains—a discrepancy that introduces bias in model evaluation and billing. For the first time, this work systematically quantifies the sequence compression behavior of mainstream tokenizers under diverse text distributions through large-scale empirical analysis, revealing the high variability of token lengths. By challenging the common simplifying assumption that token length is approximately constant, the research elucidates the limitations of treating tokens as a universal metric. These findings provide a more accurate theoretical foundation for assessing model performance, estimating computational resources, and designing fairer usage-based billing mechanisms.

Technology Category

Application Category

📝 Abstract

Frontier LLMs are increasingly utilised across academia, society and industry. A commonly used unit for comparing models, their inputs and outputs, and estimating inference pricing is the token. In general, tokens are used as a stable currency, assumed to be broadly consistent across tokenizers and contexts, enabling direct comparisons. However, tokenization varies significantly across models and domains of text, making naive interpretation of token counts problematic. We quantify this variation by providing a comprehensive empirical analysis of tokenization, exploring the compression of sequences to tokens across different distributions of textual data. Our analysis challenges commonly held heuristics about token lengths, finding them to be overly simplistic. We hope the insights of our study add clarity and intuition toward tokenization in contemporary LLMs.

Problem

Research questions and friction points this paper is trying to address.

tokenization

large language models

token count

tokenizer variation

text compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

tokenization

large language models

empirical analysis