🤖 AI Summary
This paper addresses the dual representation-level and logic-level biases prevalent in large language models (LLMs) for temporal reasoning. To this end, the authors propose a systematic diagnostic framework. They introduce the first fine-grained date-logic question-answering benchmark (190 questions), covering multi-format dates, temporal contexts, and diverse reasoning types. They formally define “semantic completeness” to quantify tokenization-induced bias and decouple failures originating from the representation layer versus the logical reasoning layer. Leveraging embedding-space analysis, output consistency checks, and controlled reasoning-path evaluation, they comprehensively assess mainstream LLMs. Experiments reveal critical weaknesses in date parsing, cross-format temporal alignment, and sequential logical deduction. The work establishes a novel, interpretable paradigm for evaluating and enhancing LLMs’ time-aware reasoning capabilities.
📝 Abstract
This paper introduces DateLogicQA, a benchmark with 190 questions covering diverse date formats, temporal contexts, and reasoning types. We propose the Semantic Integrity Metric to assess tokenization quality and analyse two biases: Representation-Level Bias, affecting embeddings, and Logical-Level Bias, influencing reasoning outputs. Our findings provide a comprehensive evaluation of LLMs' capabilities and limitations in temporal reasoning, highlighting key challenges in handling temporal data accurately. The GitHub repository for our work is available at https://github.com/gagan3012/EAIS-Temporal-Bias