π€ AI Summary
Existing research lacks a systematic benchmark for evaluating large language modelsβ (LLMs) ability to forecast AI-driven labor market shifts. To address this gap, we introduce the first time-series forecasting benchmark specifically designed for labor demand dynamics, integrating high-frequency U.S. job posting data with global occupational transition indicators and employing strict temporal splits to prevent lookahead bias. We propose a novel prompting paradigm that synergistically combines task structuring and role-based simulation, enabling systematic evaluation of diverse LLMs on short-term trend detection and long-term stability. Results show that structured prompting significantly enhances prediction robustness; role-based prompting excels in short-term capture but exhibits notable variability across industries and time horizons. The benchmark is publicly released, establishing a reproducible and extensible infrastructure for rigorous, scalable research on AIβs impact on employment.
π Abstract
Artificial intelligence is reshaping labor markets, yet we lack tools to systematically forecast its effects on employment. This paper introduces a benchmark for evaluating how well large language models (LLMs) can anticipate changes in job demand, especially in occupations affected by AI. Existing research has shown that LLMs can extract sentiment, summarize economic reports, and emulate forecaster behavior, but little work has assessed their use for forward-looking labor prediction. Our benchmark combines two complementary datasets: a high-frequency index of sector-level job postings in the United States, and a global dataset of projected occupational changes due to AI adoption. We format these data into forecasting tasks with clear temporal splits, minimizing the risk of information leakage. We then evaluate LLMs using multiple prompting strategies, comparing task-scaffolded, persona-driven, and hybrid approaches across model families. We assess both quantitative accuracy and qualitative consistency over time. Results show that structured task prompts consistently improve forecast stability, while persona prompts offer advantages on short-term trends. However, performance varies significantly across sectors and horizons, highlighting the need for domain-aware prompting and rigorous evaluation protocols. By releasing our benchmark, we aim to support future research on labor forecasting, prompt design, and LLM-based economic reasoning. This work contributes to a growing body of research on how LLMs interact with real-world economic data, and provides a reproducible testbed for studying the limits and opportunities of AI as a forecasting tool in the context of labor markets.