Around the World in 24 Hours: Probing LLM Knowledge of Time and Place

📅 2025-06-04

📈 Citations: 0

✨ Influential: 0

career value

157K/year

🤖 AI Summary

This work presents the first systematic evaluation of large language models’ (LLMs) capability in spatiotemporal joint reasoning. To this end, we introduce GeoTemp—a comprehensive benchmark covering 217 countries, 289 cities, and 37 time zones, comprising 320k samples—and propose the first evaluation framework explicitly designed for coupled spatiotemporal reasoning. We conduct zero-shot and few-shot evaluations across eight open-source dialogue models. Key findings are: (1) LLMs perform well on pure temporal reasoning but exhibit substantial performance degradation on tasks requiring spatiotemporal coupling; (2) token-level geographic name frequency (i.e., low perplexity) is a stronger predictor of model performance than regional geographic bias; (3) chain-of-thought prompting unexpectedly reduces accuracy on simple spatiotemporal tasks. These results indicate that advancing spatiotemporal reasoning requires not only scaling model size, but also improving geographic knowledge representation and optimizing training data distribution.

Technology Category

Application Category

📝 Abstract

Reasoning over time and space is essential for understanding our world. However, the abilities of language models in this area are largely unexplored as previous work has tested their abilities for logical reasoning in terms of time and space in isolation or only in simple or artificial environments. In this paper, we present the first evaluation of the ability of language models to jointly reason over time and space. To enable our analysis, we create GeoTemp, a dataset of 320k prompts covering 289 cities in 217 countries and 37 time zones. Using GeoTemp, we evaluate eight open chat models of three different model families for different combinations of temporal and geographic knowledge. We find that most models perform well on reasoning tasks involving only temporal knowledge and that overall performance improves with scale. However, performance remains constrained in tasks that require connecting temporal and geographical information. We do not find clear correlations of performance with specific geographic regions. Instead, we find a significant performance increase for location names with low model perplexity, suggesting their repeated occurrence during model training. We further demonstrate that their performance is heavily influenced by prompt formulation - a direct injection of geographical knowledge leads to performance gains, whereas, surprisingly, techniques like chain-of-thought prompting decrease performance on simpler tasks.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' joint reasoning over time and space

Assessing performance on temporal-geographic knowledge tasks

Analyzing impact of prompt design on reasoning accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Created GeoTemp dataset for evaluation

Evaluated models on temporal and geographic reasoning

Analyzed impact of prompt formulation on performance

🔎 Similar Papers

Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time