🤖 AI Summary
This work addresses the risk of temporal contamination in backtesting large language models (LLMs), where predictions may be distorted by implicit reliance on knowledge acquired after the model’s training cutoff date. To mitigate this, the authors propose TimeSPEC, a method that decomposes model outputs into atomic claims, categorizes them based on temporal verifiability, and introduces the Shapley-DCLR metric—an interpretable measure that quantifies the degree of temporal leakage per claim. This enables claim-level leakage detection and active filtering. Experiments across three forecasting tasks—Supreme Court rulings, NBA player salaries, and stock returns—reveal significant temporal leakage under standard prompting. In contrast, TimeSPEC effectively reduces Shapley-DCLR scores while preserving predictive performance, thereby substantially enhancing the reliability of backtesting evaluations.
📝 Abstract
To evaluate whether LLMs can accurately predict future events, we need the ability to \textit{backtest} them on events that have already resolved. This requires models to reason only with information available at a specified past date. Yet LLMs may inadvertently leak post-cutoff knowledge encoded during training, undermining the validity of retrospective evaluation. We introduce a claim-level framework for detecting and quantifying this \emph{temporal knowledge leakage}. Our approach decomposes model rationales into atomic claims and categorizes them by temporal verifiability, then applies \textit{Shapley values} to measure each claim's contribution to the prediction. This yields the \textbf{Shapley}-weighted \textbf{D}ecision-\textbf{C}ritical \textbf{L}eakage \textbf{R}ate (\textbf{Shapley-DCLR}), an interpretable metric that captures what fraction of decision-driving reasoning derives from leaked information. Building on this framework, we propose \textbf{Time}-\textbf{S}upervised \textbf{P}rediction with \textbf{E}xtracted \textbf{C}laims (\textbf{TimeSPEC}), which interleaves generation with claim verification and regeneration to proactively filter temporal contamination -- producing predictions where every supporting claim can be traced to sources available before the cutoff date. Experiments on 350 instances spanning U.S. Supreme Court case prediction, NBA salary estimation, and stock return ranking reveal substantial leakage in standard prompting baselines. TimeSPEC reduces Shapley-DCLR while preserving task performance, demonstrating that explicit, interpretable claim-level verification outperforms prompt-based temporal constraints for reliable backtesting.