How Long Is a Piece of String? A Brief Empirical Analysis of Tokenizers

πŸ“… 2026-01-16
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses the widespread assumption in large language models that tokens serve as stable units of measurement, despite significant variation in token length across different tokenizers and text domainsβ€”a discrepancy that introduces bias in model evaluation and billing. For the first time, this work systematically quantifies the sequence compression behavior of mainstream tokenizers under diverse text distributions through large-scale empirical analysis, revealing the high variability of token lengths. By challenging the common simplifying assumption that token length is approximately constant, the research elucidates the limitations of treating tokens as a universal metric. These findings provide a more accurate theoretical foundation for assessing model performance, estimating computational resources, and designing fairer usage-based billing mechanisms.

Technology Category

Application Category

πŸ“ Abstract
Frontier LLMs are increasingly utilised across academia, society and industry. A commonly used unit for comparing models, their inputs and outputs, and estimating inference pricing is the token. In general, tokens are used as a stable currency, assumed to be broadly consistent across tokenizers and contexts, enabling direct comparisons. However, tokenization varies significantly across models and domains of text, making naive interpretation of token counts problematic. We quantify this variation by providing a comprehensive empirical analysis of tokenization, exploring the compression of sequences to tokens across different distributions of textual data. Our analysis challenges commonly held heuristics about token lengths, finding them to be overly simplistic. We hope the insights of our study add clarity and intuition toward tokenization in contemporary LLMs.
Problem

Research questions and friction points this paper is trying to address.

tokenization
large language models
token count
tokenizer variation
text compression
Innovation

Methods, ideas, or system contributions that make the work stand out.

tokenization
large language models
empirical analysis
tokenizer variation
token compression
πŸ”Ž Similar Papers
No similar papers found.