Say Anything but This: When Tokenizer Betrays Reasoning in LLMs

📅 2026-01-21

📈 Citations: 0

✨ Influential: 0

career value

154K/year

🤖 AI Summary

This work addresses the inconsistency in reasoning caused by tokenizer-induced one-to-many mappings in large language models, wherein semantically identical texts are encoded into divergent representations. The authors propose a “tokenization consistency probing” task that systematically isolates tokenizer-related errors by precisely substituting target words within fixed contexts. Through over 11,000 controlled substitutions and detailed analysis of tokenization–detokenization behaviors, the study reveals for the first time that tokenizer deficiencies can induce “hallucinatory edits” in model outputs and establishes a taxonomy of eight classes of tokenization artifacts. Crucially, the findings demonstrate that a significant proportion of reasoning failures in mainstream open-source large language models stem from tokenization representation issues rather than limitations in model capacity or knowledge gaps.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) reason over discrete token ID sequences, yet modern subword tokenizers routinely produce non-unique encodings: multiple token ID sequences can detokenize to identical surface strings. This representational mismatch creates an unmeasured fragility wherein reasoning processes can fail. LLMs may treat two internal representations as distinct"words"even when they are semantically identical at the text level. In this work, we show that tokenization can betray LLM reasoning through one-to-many token ID mappings. We introduce a tokenization-consistency probe that requires models to replace designated target words in context while leaving all other content unchanged. The task is intentionally simple at the surface level, enabling us to attribute failures to tokenizer-detokenizer artifacts rather than to knowledge gaps or parameter limitations. Through analysis of over 11000 replacement trials across state-of-the-art open-source LLMs, we find a non-trivial rate of outputs exhibit phantom edits: cases where models operate under the illusion of correct reasoning, a phenomenon arising from tokenizer-induced representational defects. We further analyze these cases and provide a taxonomy of eight systematic tokenizer artifacts, including whitespace-boundary shifts and intra-word resegmentation. These findings indicate that part of apparent reasoning deficiency originates in the tokenizer layer, motivating tokenizer-level remedies before incurring the cost of training ever-larger models on ever-larger corpora.

Problem

Research questions and friction points this paper is trying to address.

tokenizer inconsistency

tokenization artifacts

LLM reasoning

non-unique encoding

representational mismatch

Innovation

Methods, ideas, or system contributions that make the work stand out.

tokenizer-induced reasoning failure

tokenization-consistency probe

non-unique tokenization