Is Your LLM Outdated? A Deep Look at Temporal Generalization

📅 2024-05-14

📈 Citations: 0

✨ Influential: 0

career value

149K/year

🤖 AI Summary

This work addresses the lack of evaluation frameworks for assessing large language models’ (LLMs) generalization capability in time-dynamic scenarios, introducing the novel concept of “temporal generalization.” We propose FreshBench—the first dynamic benchmark ensuring no data leakage and no subjective bias. Methodologically, we design temporally controllable tasks grounded in fresh text sampling and event prediction, integrated with an automated evaluation pipeline to systematically assess models’ understanding of past and future events. Key findings include: (1) LLM performance degrades significantly over time, with stronger models exhibiting more pronounced deterioration in long-horizon future predictions; and (2) open-source models demonstrate superior temporal adaptability compared to closed-source counterparts. This work establishes a theoretical foundation for temporal robustness research and provides a reproducible, principled evaluation paradigm for time-aware model assessment.

Technology Category

Application Category

📝 Abstract

The rapid advancement of Large Language Models (LLMs) has led to the development of benchmarks that consider temporal dynamics, however, there remains a gap in understanding how well these models can generalize across temporal contexts due to the inherent dynamic nature of language and information. This paper introduces the concept of temporal generalization in LLMs, including bias in past and future generalizations. Then we introduce FreshBench, a new evaluation framework that employs fresh text and event prediction for assessing LLMs' temporal adaptability, ensuring the evaluation process free from data leakage and subjective bias. The experiment shows significant temporal biases and a decline in performance over time. Our findings reveal that powerful models, while initially superior, tend to decline more rapidly in future generalization. Additionally, powerful open-source models demonstrate better long-term adaptability compared to their closed-source counterparts. Our code is available at https://github.com/FreedomIntelligence/FreshBench.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' ability to generalize across temporal contexts

Evaluating temporal biases and performance decline over time

Comparing long-term adaptability of open-source vs closed-source LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces FreshBench for temporal adaptability evaluation

Assesses LLMs using fresh text and event prediction

Reveals temporal biases and performance decline trends

🔎 Similar Papers

Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time