🤖 AI Summary
This work formalizes tokenization—the fundamental preprocessing step in natural language processing—as a constrained combinatorial optimization problem and rigorously proves its NP-hardness. To address this computationally challenging problem, we propose GreedTok, an efficient greedy algorithm with polynomial-time complexity. Furthermore, by relaxing the problem to a weighted maximum coverage formulation, we design GreedWMC, an approximation algorithm with theoretical guarantees. Empirical evaluation on real-world corpora demonstrates that GreedTok significantly outperforms the widely adopted Byte Pair Encoding (BPE) method, while achieving objective values comparable to those of GreedWMC. Beyond algorithmic contributions, this work establishes the first formal theoretical connection between tokenization and combinatorial optimization. It provides a novel, principled framework for tokenization that simultaneously offers provable guarantees and strong empirical performance—paving the way for more interpretable, controllable, and theoretically grounded subword segmentation models.
📝 Abstract
Tokenization is the process of encoding strings into tokens from a fixed vocabulary of size $k$ and is widely utilized in Natural Language Processing applications. The leading tokenization algorithm today is Byte Pair Encoding (BPE), which formulates the tokenization problem as a compression problem and tackles it by performing sequences of merges. In this work, we formulate tokenization as an optimization objective, show that it is NP-hard via a simple reduction from vertex cover, and propose a polynomial-time greedy algorithm GreedTok. Our formulation naturally relaxes to the well-studied weighted maximum coverage problem which has a simple $(1 - 1/e)$-approximation algorithm GreedWMC. Through empirical evaluations on real-world corpora, we show that GreedTok outperforms BPE, while achieving a comparable objective score as GreedWMC (which could have achieved a higher score due to relaxation).