DateLogicQA: Benchmarking Temporal Biases in Large Language Models

📅 2024-12-17
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the dual representation-level and logic-level biases prevalent in large language models (LLMs) for temporal reasoning. To this end, the authors propose a systematic diagnostic framework. They introduce the first fine-grained date-logic question-answering benchmark (190 questions), covering multi-format dates, temporal contexts, and diverse reasoning types. They formally define “semantic completeness” to quantify tokenization-induced bias and decouple failures originating from the representation layer versus the logical reasoning layer. Leveraging embedding-space analysis, output consistency checks, and controlled reasoning-path evaluation, they comprehensively assess mainstream LLMs. Experiments reveal critical weaknesses in date parsing, cross-format temporal alignment, and sequential logical deduction. The work establishes a novel, interpretable paradigm for evaluating and enhancing LLMs’ time-aware reasoning capabilities.

Technology Category

Application Category

📝 Abstract
This paper introduces DateLogicQA, a benchmark with 190 questions covering diverse date formats, temporal contexts, and reasoning types. We propose the Semantic Integrity Metric to assess tokenization quality and analyse two biases: Representation-Level Bias, affecting embeddings, and Logical-Level Bias, influencing reasoning outputs. Our findings provide a comprehensive evaluation of LLMs' capabilities and limitations in temporal reasoning, highlighting key challenges in handling temporal data accurately. The GitHub repository for our work is available at https://github.com/gagan3012/EAIS-Temporal-Bias
Problem

Research questions and friction points this paper is trying to address.

Evaluating temporal biases in LLMs using DateLogicQA benchmark
Assessing tokenization quality with Semantic Integrity Metric
Analyzing Representation-Level and Logical-Level temporal biases
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces DateLogicQA benchmark for temporal reasoning
Proposes Semantic Integrity Metric for tokenization
Analyzes Representation and Logical-Level Biases
🔎 Similar Papers
No similar papers found.