Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning

📅 2025-05-22

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Modern BPE tokenizers erroneously segment calendar dates (e.g., “20250312”) into semantically meaningless subtokens (e.g., “202”, “503”, “12”), exacerbating token redundancy and disrupting temporal structure—severely degrading model performance on time-reasoning tasks. Method: We first propose the “date fragmentation rate” to quantify this defect; construct DateAugBench—a comprehensive benchmark spanning historical, contemporary, and future date scenarios; and conduct layer-wise probing and causal attention analysis to investigate how LLMs internally process fragmented dates. Contribution/Results: We discover that LLMs spontaneously develop a cross-layer “date abstraction” mechanism—reconstructing fragments along a hierarchical year→month→day pathway, yielding structured temporal reasoning distinct from human cognition. Empirically, fragmentation reduces accuracy on rare dates by up to 10 percentage points; larger models exhibit this mechanism earlier in depth; and it generalizes across date parsing, format-invariant interpretation, and date arithmetic tasks.

Technology Category

Application Category

📝 Abstract

Modern BPE tokenizers often split calendar dates into meaningless fragments, e.g., 20250312 $ ightarrow$ 202, 503, 12, inflating token counts and obscuring the inherent structure needed for robust temporal reasoning. In this work, we (1) introduce a simple yet interpretable metric, termed date fragmentation ratio, that measures how faithfully a tokenizer preserves multi-digit date components; (2) release DateAugBench, a suite of 6500 examples spanning three temporal reasoning tasks: context-based date resolution, format-invariance puzzles, and date arithmetic across historical, contemporary, and future regimes; and (3) through layer-wise probing and causal attention-hop analyses, uncover an emergent date-abstraction mechanism whereby large language models stitch together the fragments of month, day, and year components for temporal reasoning. Our experiments show that excessive fragmentation correlates with accuracy drops of up to 10 points on uncommon dates like historical and futuristic dates. Further, we find that the larger the model, the faster the emergent date abstraction that heals date fragments is accomplished. Lastly, we observe a reasoning path that LLMs follow to assemble date fragments, typically differing from human interpretation (year $ ightarrow$ month $ ightarrow$ day).

Problem

Research questions and friction points this paper is trying to address.

BPE tokenizers split dates into meaningless fragments

Date fragmentation hinders robust temporal reasoning in LLMs

Excessive date fragmentation reduces accuracy on uncommon dates

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces date fragmentation ratio metric

Releases DateAugBench for temporal reasoning

Uncovers emergent date-abstraction mechanism in LLMs

🔎 Similar Papers

Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time