🤖 AI Summary
This study investigates the root causes underlying the limited temporal reasoning capabilities of large language models—specifically, whether tokenization strategies or internal representation mechanisms are primarily responsible. To this end, we introduce MultiTempBench, a multilingual temporal reasoning benchmark encompassing three task types, five languages, and multiple calendar systems, yielding 15,000 controlled samples to systematically evaluate 20 models. We propose a novel metric, the multilingual Date Fragmentation Ratio (mDFR), and integrate geometric probing analyses to demonstrate, for the first time, that in low-resource languages, severe tokenization-induced date fragmentation disrupts temporal structure and predominantly drives performance degradation, whereas in high-resource languages, the linearity of temporal representations serves as the key predictor of model performance. These findings are corroborated through mixed-effects regression modeling, which confirms the differential dominance of tokenization versus representation depending on language resource availability.
📝 Abstract
We present MultiTempBench, a multilingual temporal reasoning benchmark spanning three tasks, date arithmetic, time zone conversion, and temporal relation extraction across five languages (English, German, Chinese, Arabic, and Hausa) and multiple calendar conventions (Gregorian, Hijri, and Chinese Lunar). MultiTempBench contains $15,000$ examples built by translating $750$ curated English questions and expanding each into controlled date-format variants. We evaluate 20 LLMs and introduce the multilingual Date Fragmentation Ratio (mDFR), calibrated with human severity ratings, together with geometric-probing analyses of internal temporal representations. We find tokenisation quality of temporal artefacts is a resource-dependent bottleneck: in low-resource languages and rarer calendar formats, fragmentation disrupts Year/Month/Day separation and accuracy collapses, while high-resource settings are often robust to digit-level splitting. Beyond tokenisation, crossed mixed-effects regression shows that temporal linearity is the strongest predictor of temporal reasoning in high-resource languages, whereas fragmentation is the stronger predictor in low-resource languages. Code is available at: https://github.com/gagan3012/mtb