🤖 AI Summary
This study addresses the widespread reliance on the “simulated ignorance” (SI) method for retrospective evaluation of large language models’ (LLMs’) predictive capabilities, questioning its validity in reflecting true performance under genuine ignorance. Through a systematic validation across 477 competition-level problems and nine models, the authors compare SI against true ignorance (TI), incorporating knowledge cutoff controls, chain-of-thought (CoT) analysis, and multi-model benchmarks. They find a substantial 52% performance gap between SI and TI, with CoT failing to mitigate interference from prior knowledge. Surprisingly, reasoning-optimized models exhibit even lower SI fidelity. These findings demonstrate that SI is unreliable for predictive capability benchmarking and challenge the prevailing paradigm of retrospective LLM evaluation.
📝 Abstract
Evaluating LLM forecasting capabilities is constrained by a fundamental tension: prospective evaluation offers methodological rigor but prohibitive latency, while retrospective forecasting (RF) -- evaluating on already-resolved events -- faces rapidly shrinking clean evaluation data as SOTA models possess increasingly recent knowledge cutoffs. Simulated Ignorance (SI), prompting models to suppress pre-cutoff knowledge, has emerged as a potential solution. We provide the first systematic test of whether SI can approximate True Ignorance (TI). Across 477 competition-level questions and 9 models, we find that SI fails systematically: (1) cutoff instructions leave a 52% performance gap between SI and TI; (2) chain-of-thought reasoning fails to suppress prior knowledge, even when reasoning traces contain no explicit post-cutoff references; (3) reasoning-optimized models exhibit worse SI fidelity despite superior reasoning trace quality. These findings demonstrate that prompts cannot reliably"rewind"model knowledge. We conclude that RF on pre-cutoff events is methodologically flawed; we recommend against using SI-based retrospective setups to benchmark forecasting capabilities.