🤖 AI Summary
This work addresses the lack of evaluation frameworks for assessing large language models’ (LLMs) generalization capability in time-dynamic scenarios, introducing the novel concept of “temporal generalization.” We propose FreshBench—the first dynamic benchmark ensuring no data leakage and no subjective bias. Methodologically, we design temporally controllable tasks grounded in fresh text sampling and event prediction, integrated with an automated evaluation pipeline to systematically assess models’ understanding of past and future events. Key findings include: (1) LLM performance degrades significantly over time, with stronger models exhibiting more pronounced deterioration in long-horizon future predictions; and (2) open-source models demonstrate superior temporal adaptability compared to closed-source counterparts. This work establishes a theoretical foundation for temporal robustness research and provides a reproducible, principled evaluation paradigm for time-aware model assessment.
📝 Abstract
The rapid advancement of Large Language Models (LLMs) has led to the development of benchmarks that consider temporal dynamics, however, there remains a gap in understanding how well these models can generalize across temporal contexts due to the inherent dynamic nature of language and information. This paper introduces the concept of temporal generalization in LLMs, including bias in past and future generalizations. Then we introduce FreshBench, a new evaluation framework that employs fresh text and event prediction for assessing LLMs' temporal adaptability, ensuring the evaluation process free from data leakage and subjective bias. The experiment shows significant temporal biases and a decline in performance over time. Our findings reveal that powerful models, while initially superior, tend to decline more rapidly in future generalization. Additionally, powerful open-source models demonstrate better long-term adaptability compared to their closed-source counterparts. Our code is available at https://github.com/FreedomIntelligence/FreshBench.