🤖 AI Summary
Existing LLM reasoning benchmarks suffer from a polarization problem: informal datasets lack verifiability and are prone to bias, while formal systems (e.g., Lean) struggle to model realistic agent decision chains. To address this, we propose TempoBench—the first reasoning diagnostic benchmark that simultaneously ensures formal verifiability and real-world alignment. Our contributions are threefold: (1) a parameterized difficulty framework that decomposes reasoning capability along two orthogonal dimensions—temporal tracking and temporal causality; (2) the first reasoning decomposition framework integrating formal verification with task semantics; and (3) two novel quantitative metrics—Temporal Trace Evaluation (TTE) and Temporal Causal Evaluation (TCE)—to assess multi-step reasoning fidelity and causal structure extraction, respectively. Experiments reveal a stark performance gap: state-of-the-art models achieve 65.6% on TCE-normal but plummet to 7.5% on TCE-hard, exposing a critical bottleneck in modeling complex temporal systems.
📝 Abstract
Large Language Models (LLMs) are increasingly excelling and outpacing human performance on many tasks. However, to improve LLM reasoning, researchers either rely on ad-hoc generated datasets or formal mathematical proof systems such as the Lean proof assistant. Whilst ad-hoc generated methods can capture the decision chains of real-world reasoning processes, they may encode some inadvertent bias in the space of reasoning they cover; they also cannot be formally verified. On the other hand, systems like Lean can guarantee verifiability, but are not well-suited to capture the nature of agentic decision chain-based tasks. This creates a gap both in performance for functions such as business agents or code assistants, and in the usefulness of LLM reasoning benchmarks, whereby these fall short in reasoning structure or real-world alignment. We introduce TempoBench, the first formally grounded and verifiable diagnostic benchmark that parametrizes difficulty to systematically analyze how LLMs perform reasoning. TempoBench uses two evaluation benchmarks to break down reasoning ability. First, temporal trace evaluation (TTE) tests the ability of an LLM to understand and simulate the execution of a given multi-step reasoning system. Subsequently, temporal causal evaluation (TCE) tests an LLM's ability to perform multi-step causal reasoning and to distill cause-and-effect relations from complex systems. We find that models score 65.6% on TCE-normal, and 7.5% on TCE-hard. This shows that state-of-the-art LLMs clearly understand the TCE task but perform poorly as system complexity increases. Our code is available at our href{https://github.com/nik-hz/tempobench}{GitHub repository}.