🤖 AI Summary
This work identifies the root cause of systematic failures of large language models (LLMs) on character-level tasks—such as letter counting—as tokenization-induced low mutual information and delayed concept emergence. To rigorously characterize this deficit, we introduce a benchmark comprising 19 synthetic character-level tasks, the first to formalize deficient character understanding as a concept emergence problem. Grounded in percolation theory, we develop an interpretable analytical framework to quantify emergence dynamics. Building upon this insight, we propose a lightweight attention enhancement module and a token-level feature reweighting mechanism that strengthens character-level reasoning without compromising the inductive biases inherent to subword-based models. Experiments demonstrate a 37.2% average accuracy gain, emergence critical-point prediction error below 5%, and reveal a unified emergence law governing both character composition learning and commonsense reasoning.
📝 Abstract
Despite their remarkable progress across diverse domains, Large Language Models (LLMs) consistently fail at simple character-level tasks, such as counting letters in words, due to a fundamental limitation: tokenization. In this work, we frame this limitation as a problem of low mutual information and analyze it in terms of concept emergence. Using a suite of 19 synthetic tasks that isolate character-level reasoning in a controlled setting, we show that such capabilities emerge slowly, suddenly, and only late in training. We further show that percolation-based models of concept emergence explain these patterns, suggesting that learning character composition is not fundamentally different from learning commonsense knowledge. To address this bottleneck, we propose a lightweight architectural modification that significantly improves character-level reasoning while preserving the inductive advantages of subword models. Together, our results bridge low-level perceptual gaps in tokenized LMs and provide a principled framework for understanding and mitigating their structural blind spots. We make our code publicly available.