TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar

📅 2025-10-16

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Large language models (LLMs) employ statistical subword tokenizers (e.g., BPE), causing semantically equivalent code to be inconsistently tokenized due to superficial variations—such as whitespace or identifier naming—thereby undermining model reliability in code understanding and generation. Method: This work first systematically identifies the structural misalignment between subword tokenization and program syntax, proposing TokDrift: a framework that generates tokenization-heterogeneous yet functionally equivalent code variants via semantics-preserving rewriting rules, and pinpoints bias origins through layer-wise embedding analysis. Contribution/Results: Evaluated across nine code LLMs—including models with over 30B parameters—our experiments demonstrate that minor formatting changes significantly perturb model outputs, with the root cause traced to early embedding layers’ neglect of syntactic boundaries. This study reveals tokenizer-induced bias as a critical latent bottleneck for code LLM reliability and advocates a paradigm shift toward syntax-aware tokenization design.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) for code rely on subword tokenizers, such as byte-pair encoding (BPE), learned from mixed natural language text and programming language code but driven by statistics rather than grammar. As a result, semantically identical code snippets can be tokenized differently depending on superficial factors such as whitespace or identifier naming. To measure the impact of this misalignment, we introduce TokDrift, a framework that applies semantic-preserving rewrite rules to create code variants differing only in tokenization. Across nine code LLMs, including large ones with over 30B parameters, even minor formatting changes can cause substantial shifts in model behavior. Layer-wise analysis shows that the issue originates in early embeddings, where subword segmentation fails to capture grammar token boundaries. Our findings identify misaligned tokenization as a hidden obstacle to reliable code understanding and generation, highlighting the need for grammar-aware tokenization for future code LLMs.

Problem

Research questions and friction points this paper is trying to address.

LLMs use statistical tokenization that ignores programming language grammar

Semantically identical code gets different tokens due to formatting variations

Token misalignment causes unreliable model behavior in code understanding

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces TokDrift framework for measuring tokenization impact

Applies semantic-preserving rewrite rules to create code variants

Highlights need for grammar-aware tokenization in code LLMs

🔎 Similar Papers

From Tokens to Words: On the Inner Lexicon of LLMs