🤖 AI Summary
Current large language model training relies on hard-coded tokenization, which hinders end-to-end learning. This work proposes embedding the tokenization process directly into the model architecture and introduces, for the first time, a reinforcement learning approach based on a scoring function to directly optimize discrete token boundaries by minimizing task-specific loss. To address the high variance inherent in this setting, the method incorporates a temporal discounting mechanism, enabling stable and trainable end-to-end tokenization. Evaluated on a 100-million-parameter model, the approach significantly outperforms existing baselines—including straight-through estimators—on both qualitative and quantitative metrics, offering stronger theoretical guarantees alongside improved empirical performance.
📝 Abstract
Tokenization is a hardcoded compression step which remains in the training pipeline of Large Language Models (LLMs), despite a general trend towards architectures becoming increasingly end-to-end. Prior work has shown promising results at scale in bringing this compression step inside the LLMs'architecture with heuristics to draw token boundaries, and also attempts to learn these token boundaries with straight-through estimates, which treat the problem of drawing discrete token boundaries as a continuous one. We show that these token boundaries can instead be learned using score function estimates, which have tighter theoretical guarantees due to directly optimizing the problem of drawing discrete token boundaries to minimize loss. We observe that techniques from reinforcement learning, such as time discounting, are necessary to reduce the variance of this score function sufficiently to make it practicable. We demonstrate that the resultant method outperforms prior proposed straight-through estimates, both qualitatively and quantitatively at the $100$ million parameter scale.