StochasTok: Improving Fine-Grained Subword Understanding in LLMs

📅 2025-06-02

📈 Citations: 1

✨ Influential: 0

career value

157K/year

🤖 AI Summary

Existing large language models (LLMs) exhibit poor performance on fine-grained subword tasks—such as character counting, spelling correction, acronym interpretation, and rhyme detection—primarily because standard tokenizers obscure intra-word structure. While character-level or tokenizer-free approaches mitigate this issue, they incur substantial computational overhead and suffer from instability. This paper introduces StochasTok, a lightweight, stochastic, plug-and-play tokenization mechanism: during training, it dynamically applies random token splits to expose subword structure, requiring no architectural modifications or full re-pretraining. Its core innovation is zero-cost post-training injection of subword awareness, fully compatible with arbitrary LLMs. Experiments demonstrate that, without increasing inference latency, StochasTok improves the accuracy of counting the letter “r” in “berry” by 47% and significantly enhances performance across diverse wordplay tasks—effectively breaking the traditional efficiency–effectiveness trade-off.

Technology Category

Application Category

📝 Abstract

Subword-level understanding is integral to numerous tasks, including understanding multi-digit numbers, spelling mistakes, abbreviations, rhyming, and wordplay. Despite this, current large language models (LLMs) still often struggle with seemingly simple subword-level tasks like How many 'r's in 'strawberry'?. A key factor behind these failures is tokenization which obscures the fine-grained structure of words. Current alternatives, such as character-level and dropout tokenization methods, significantly increase computational costs and provide inconsistent improvements. In this paper we revisit tokenization and introduce StochasTok, a simple, efficient stochastic tokenization scheme that randomly splits tokens during training, allowing LLMs to 'see' their internal structure. Our experiments show that pretraining with StochasTok substantially improves LLMs' downstream performance across multiple subword-level language games, including character counting, substring identification, and math tasks. Furthermore, StochasTok's simplicity allows seamless integration at any stage of the training pipeline; and we demonstrate that post-training with StochasTok can instill improved subword understanding into existing pretrained models, thus avoiding costly pretraining from scratch. These dramatic improvements achieved with a minimal change suggest StochasTok holds exciting potential when applied to larger, more capable models. Code open-sourced at: https://github.com/anyasims/stochastok.

Problem

Research questions and friction points this paper is trying to address.

Improving subword-level understanding in large language models

Addressing tokenization issues obscuring word structure

Reducing computational costs of current tokenization methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Stochastic tokenization for subword understanding

Random token splitting during training

Seamless integration in training pipeline

🔎 Similar Papers

From Tokens to Words: On the Inner Lexicon of LLMs