OracleProto: A Reproducible Framework for Benchmarking LLM Native Forecasting via Knowledge Cutoff and Temporal Masking

📅 2026-05-05
📈 Citations: 0
Influential: 0
📄 PDF

career value

171K/year
🤖 AI Summary
This work addresses the challenge of evaluating the predictive capabilities of large language models, where real-time assessments lack reproducibility and retrospective evaluations risk contamination from pretraining knowledge. The authors propose a reproducible evaluation framework that enforces strict information boundaries through mechanisms including knowledge cutoff alignment, tool-level temporal masking, content-level leakage detection, discrete answer normalization, and hierarchical scoring. By reframing historical events as time-bounded prediction tasks, the framework reduces information leakage to approximately 1%, substantially outperforming approaches relying solely on tool-based filtering. Evaluation of six prominent large language models on the FutureX-Past dataset demonstrates that the framework enables fair cross-model comparisons and efficiently yields high-quality training signals.
📝 Abstract
Large language models are moving from static text generators toward real-world decision-support systems, where forecasting is a composite capability that links information gathering, evidence integration, situational judgment, and action-oriented decision making. This capability is in broad demand across finance, policy, industry, and scientific research, yet its evaluation remains difficult: live benchmarks evaluate forecasts before answers exist, making them the cleanest way to measure forecasting ability, but they expire once events resolve; retrospective benchmarks are reproducible, but they cannot reliably distinguish genuine forecasting from facts a model may have already learned during pretraining. Prompting models to "pretend not to know" cannot replace a genuine knowledge boundary. We propose OracleProto, a reproducible framework for evaluating LLM native forecasting capability. OracleProto reconstructs resolved events into time-bounded forecasting samples by combining model-cutoff-aligned sample admission, tool-level temporal masking, content-level leakage detection, discrete answer normalization, and hierarchical scoring. Instantiated on a FutureX-Past-derived dataset with six contemporary LLMs, OracleProto distinguishes forecasting quality, sampling stability, and cost efficiency under controlled information boundaries, while reducing residual leakage to the $1\%$ level, an order of magnitude below tool-only temporal filtering. OracleProto turns LLM forecasting from one-off evaluation into an auditable, reusable, and trainable dataset-level capability, providing a unified interface for fair cross-model comparison and a controlled signal source for downstream SFT and RL. Code and data are available at https://github.com/MaYiding/OracleProto and https://huggingface.co/datasets/MaYiding/OracleProto.
Problem

Research questions and friction points this paper is trying to address.

LLM forecasting
knowledge cutoff
temporal masking
benchmark reproducibility
information leakage
Innovation

Methods, ideas, or system contributions that make the work stand out.

temporal masking
knowledge cutoff
forecasting benchmark
information leakage detection
reproducible evaluation