π€ AI Summary
This study addresses the widespread assumption in large language models that tokens serve as stable units of measurement, despite significant variation in token length across different tokenizers and text domainsβa discrepancy that introduces bias in model evaluation and billing. For the first time, this work systematically quantifies the sequence compression behavior of mainstream tokenizers under diverse text distributions through large-scale empirical analysis, revealing the high variability of token lengths. By challenging the common simplifying assumption that token length is approximately constant, the research elucidates the limitations of treating tokens as a universal metric. These findings provide a more accurate theoretical foundation for assessing model performance, estimating computational resources, and designing fairer usage-based billing mechanisms.
π Abstract
Frontier LLMs are increasingly utilised across academia, society and industry. A commonly used unit for comparing models, their inputs and outputs, and estimating inference pricing is the token. In general, tokens are used as a stable currency, assumed to be broadly consistent across tokenizers and contexts, enabling direct comparisons. However, tokenization varies significantly across models and domains of text, making naive interpretation of token counts problematic. We quantify this variation by providing a comprehensive empirical analysis of tokenization, exploring the compression of sequences to tokens across different distributions of textual data. Our analysis challenges commonly held heuristics about token lengths, finding them to be overly simplistic. We hope the insights of our study add clarity and intuition toward tokenization in contemporary LLMs.