ExAnte: A Benchmark for Ex-Ante Inference in Large Language Models

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Large language models (LLMs) suffer from pervasive temporal information leakage in ex-ante reasoning—frequently invoking post-deadline knowledge despite explicit temporal cutoff prompts, thereby distorting predictions. To address this, we introduce the first multi-domain benchmark specifically designed for ex-ante reasoning, covering stock prediction, Wikipedia events, scientific publication forecasting, and temporal question answering. We formally define and quantify “temporal information leakage rate” (TILR), revealing leakage rates of 30–70% across mainstream models. Our temporal-aware evaluation framework comprises three key components: (1) timestamp-annotated data construction, (2) cross-domain knowledge isolation to prevent spillover, and (3) multi-task prompt robustness testing under strict temporal constraints. This benchmark fills a critical gap in evaluating LLMs’ temporal consistency and reasoning fidelity under time-bound conditions. It provides a reproducible, quantifiable assessment standard for time-sensitive applications—including financial forecasting, policy simulation, and historical counterfactual analysis—enabling rigorous scrutiny of models’ adherence to temporal boundaries.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) face significant challenges in ex-ante reasoning, where analysis, inference, or predictions must be made without access to information from future events. Even with explicit prompts enforcing temporal cutoffs, LLMs often generate outputs influenced by internalized knowledge of events beyond the specified cutoff. This paper introduces a novel task and benchmark designed to evaluate the ability of LLMs to reason while adhering to such temporal constraints. The benchmark includes a variety of tasks: stock prediction, Wikipedia event prediction, scientific publication prediction, and Question Answering (QA), designed to assess factual knowledge under temporal cutoff constraints. We use leakage rate to quantify models' reliance on future information beyond cutoff timestamps. Experimental results reveal that LLMs struggle to consistently adhere to temporal cutoffs across common prompting strategies and tasks, demonstrating persistent challenges in ex-ante reasoning. This benchmark provides a potential evaluation framework to advance the development of LLMs' temporal reasoning ability for time-sensitive applications.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to reason without future event knowledge

Assessing factual knowledge under temporal cutoff constraints

Measuring leakage rate of LLMs' reliance on future information

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces benchmark for ex-ante LLM evaluation

Uses leakage rate to quantify future information reliance

Assesses temporal reasoning via diverse cutoff tasks

🔎 Similar Papers

No similar papers found.