Tokenisation over Bounded Alphabets is Hard

📅 2025-11-19

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

This work investigates the computational complexity of tokenization over finite alphabets—particularly binary and unary alphabets. Using constructive reductions, we establish that bottom-up tokenization and direct tokenization are both NP-complete over binary alphabets, and admit no polynomial-time approximation algorithms unless P = NP; direct tokenization remains NP-complete even over unary alphabets. These results demonstrate that the intrinsic computational intractability of tokenization does not stem from large alphabet size or structural complexity, but rather constitutes a fundamental hardness barrier. The paper thus establishes tight theoretical lower bounds for tokenization, explaining the empirical reliance of practical NLP systems on heuristic approaches. Our findings provide a foundational complexity-theoretic justification for modeling choices in natural language processing, clarifying why exact, efficient algorithms for general tokenization are unlikely to exist.

Technology Category

Application Category

📝 Abstract

Recent works have shown that tokenisation is NP-complete. However, these works assume tokenisation is applied to inputs with unboundedly large alphabets -- an unrealistic assumption, given that in practice tokenisers operate over fixed-size alphabets, such as bytes or Unicode characters. We close this gap by analysing tokenisation over bounded $n$-ary alphabets, considering two natural variants: bottom-up tokenisation and direct tokenisation, where we must, respectively, select a sequence of merge operations or a vocabulary whose application optimally compresses a dataset. First, we note that proving hardness results for an $n$-ary alphabet proves the same results for alphabets of any larger size. We then prove that even with binary alphabets, both variants are not only NP-complete, but admit no polynomial-time approximation scheme (unless P=NP). We further show that direct tokenisation remains NP-complete even when applied to unary alphabets. While unary alphabets may not be practically useful, this result establishes that the computational intractability of tokenisation is not an artifact of large alphabets or complex constructions, but a fundamental barrier. Overall, our results explain why practical algorithms such as BPE and UnigramLM are heuristic, and points toward approximation algorithms being an important path going forward for tokenisation research.

Problem

Research questions and friction points this paper is trying to address.

Proves tokenisation is NP-complete even for binary alphabets

Shows direct tokenisation remains hard for unary alphabets

Explains why practical tokenisation algorithms must be heuristic

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proved tokenisation NP-complete for binary alphabets

Showed no polynomial-time approximation scheme exists

Established intractability as fundamental not artifact

🔎 Similar Papers

Unsupervised Morphological Tree Tokenizer