π€ AI Summary
This study addresses the parametric lookahead bias introduced by large language models (LLMs) in financial backtesting, stemming from future information inadvertently embedded in their pretraining data, which distorts performance evaluation. To mitigate this without retraining, the authors propose FinCADβa context-aware decoding framework that dynamically suppresses the modelβs reliance on memorized historical financial outcomes during inference. The method innovatively integrates an adversarial bias-discovery pipeline with entity-date adaptive rules to precisely calibrate the strength of memory suppression. Experiments across five 7β14B parameter LLMs and five blue-chip stocks demonstrate that FinCAD reduces in-sample backtested returns by up to 67.1%, while preserving out-of-sample 2025 returns and Sharpe ratios nearly unchanged. General reasoning capabilities degrade by less than 1.7 points, and the Spearman correlation between in-sample and out-of-sample performance improves from +0.779 to +0.846.
π Abstract
Backtesting large language models (LLMs) on historical financial data is unreliable because pre-training cuts off after the events happened. An LLM trained in 2024 already "knows" which way 2018-2020 stocks moved. We name this failure parametric look-ahead bias and propose FinCAD, an inference-time adaptation of Context-Aware Decoding that suppresses an LLM's memory of historical outcomes without retraining. FinCAD pairs an adversarial bias-discovery pipeline that learns a model-specific memory-activating prior prompt with an entity- and date-adaptive rule that scales the CAD strength to per-(entity, date) memorisation, so the penalty fires on memorised in-sample dates and decays to zero out-of-sample. Across five 7-14B LLMs and five mega-cap equities, FinCAD cuts in-sample backtest returns by up to -67.1% on memorised dates while leaving 2025 out-of-sample returns within $8K and Sharpe within 0.10 of baseline, and preserves general-purpose reasoning within 1.7 pts. On an eleven-model leaderboard, it raises the in-sample / out-of-sample Spearman correlation from +0.779 to +0.846, recovering rankings that genuinely predict out-of-sample performance.