🤖 AI Summary
This work addresses a fundamental limitation of existing subword tokenization algorithms—such as BPE and Unigram—which rely on greedy strategies that optimize only local objectives and thus struggle to approach global optimality. The paper presents the first formalization of tokenizer construction as a linear programming problem and introduces ConvexTok, a novel algorithm based on convex optimization. By leveraging convex relaxation and optimization techniques, ConvexTok enables globally optimal vocabulary selection. The method provably approximates the theoretical optimum with less than 1% error under typical vocabulary sizes, consistently outperforms baseline tokenizers in segmentation quality and language model bits-per-byte (BpB), and yields measurable improvements in downstream task performance.
📝 Abstract
Tokenisation is an integral part of the current NLP pipeline. Current tokenisation algorithms such as BPE and Unigram are greedy algorithms -- they make locally optimal decisions without considering the resulting vocabulary as a whole. We instead formulate tokeniser construction as a linear program and solve it using convex optimisation tools, yielding a new algorithm we call ConvexTok. We find ConvexTok consistently improves intrinsic tokenisation metrics and the bits-per-byte (BpB) achieved by language models; it also improves downstream task performance, but less consistently. Furthermore, ConvexTok allows the user to certify how far their tokeniser is from optimal, with respect to a certain objective, via a lower bound, and we empirically find it to be within 1\% of optimal at common vocabulary sizes.