🤖 AI Summary
Modern BPE tokenizers erroneously segment calendar dates (e.g., “20250312”) into semantically meaningless subtokens (e.g., “202”, “503”, “12”), exacerbating token redundancy and disrupting temporal structure—severely degrading model performance on time-reasoning tasks.
Method: We first propose the “date fragmentation rate” to quantify this defect; construct DateAugBench—a comprehensive benchmark spanning historical, contemporary, and future date scenarios; and conduct layer-wise probing and causal attention analysis to investigate how LLMs internally process fragmented dates.
Contribution/Results: We discover that LLMs spontaneously develop a cross-layer “date abstraction” mechanism—reconstructing fragments along a hierarchical year→month→day pathway, yielding structured temporal reasoning distinct from human cognition. Empirically, fragmentation reduces accuracy on rare dates by up to 10 percentage points; larger models exhibit this mechanism earlier in depth; and it generalizes across date parsing, format-invariant interpretation, and date arithmetic tasks.
📝 Abstract
Modern BPE tokenizers often split calendar dates into meaningless fragments, e.g., 20250312 $
ightarrow$ 202, 503, 12, inflating token counts and obscuring the inherent structure needed for robust temporal reasoning. In this work, we (1) introduce a simple yet interpretable metric, termed date fragmentation ratio, that measures how faithfully a tokenizer preserves multi-digit date components; (2) release DateAugBench, a suite of 6500 examples spanning three temporal reasoning tasks: context-based date resolution, format-invariance puzzles, and date arithmetic across historical, contemporary, and future regimes; and (3) through layer-wise probing and causal attention-hop analyses, uncover an emergent date-abstraction mechanism whereby large language models stitch together the fragments of month, day, and year components for temporal reasoning. Our experiments show that excessive fragmentation correlates with accuracy drops of up to 10 points on uncommon dates like historical and futuristic dates. Further, we find that the larger the model, the faster the emergent date abstraction that heals date fragments is accomplished. Lastly, we observe a reasoning path that LLMs follow to assemble date fragments, typically differing from human interpretation (year $
ightarrow$ month $
ightarrow$ day).