🤖 AI Summary
This study addresses the lack of systematic evaluation of large language models (LLMs) on temporal text classification (TTC) tasks by presenting the first comprehensive comparison of leading closed-source models (e.g., GPT-4o, Claude 3.5, Gemini 1.5) and open-source models (e.g., LLaMA 3.2, Gemma 2, Mistral, Nemotron 4) across three historical corpora under zero-shot, few-shot, and fine-tuned settings. The findings reveal that closed-source models achieve superior performance with few-shot prompting, while fine-tuning substantially improves open-source models’ effectiveness—though they still lag behind their closed-source counterparts. This work underscores the critical influence of prompting strategies and model type on TTC performance and establishes a foundational benchmark and key insights for advancing temporal understanding in textual data.
📝 Abstract
Languages change over time. Computational models can be trained to recognize such changes enabling them to estimate the publication date of texts. Despite recent advancements in Large Language Models (LLMs), their performance on automatic dating of texts, also known as Temporal Text Classification (TTC), has not been explored. This study provides the first systematic evaluation of leading proprietary (Claude 3.5, GPT-4o, Gemini 1.5) and open-source (LLaMA 3.2, Gemma 2, Mistral, Nemotron 4) LLMs on TTC using three historical corpora, two in English and one in Portuguese. We test zero-shot and few-shot prompting, and fine-tuning settings. Our results indicate that proprietary models perform well, especially with few-shot prompting. They also indicate that fine-tuning substantially improves open-source models but that they still fail to match the performance delivered by proprietary LLMs.