DateLogicQA: Benchmarking Temporal Biases in Large Language Models

📅 2024-12-17

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

139K/year

🤖 AI Summary

This paper addresses the dual representation-level and logic-level biases prevalent in large language models (LLMs) for temporal reasoning. To this end, the authors propose a systematic diagnostic framework. They introduce the first fine-grained date-logic question-answering benchmark (190 questions), covering multi-format dates, temporal contexts, and diverse reasoning types. They formally define “semantic completeness” to quantify tokenization-induced bias and decouple failures originating from the representation layer versus the logical reasoning layer. Leveraging embedding-space analysis, output consistency checks, and controlled reasoning-path evaluation, they comprehensively assess mainstream LLMs. Experiments reveal critical weaknesses in date parsing, cross-format temporal alignment, and sequential logical deduction. The work establishes a novel, interpretable paradigm for evaluating and enhancing LLMs’ time-aware reasoning capabilities.

Technology Category

Application Category

📝 Abstract

This paper introduces DateLogicQA, a benchmark with 190 questions covering diverse date formats, temporal contexts, and reasoning types. We propose the Semantic Integrity Metric to assess tokenization quality and analyse two biases: Representation-Level Bias, affecting embeddings, and Logical-Level Bias, influencing reasoning outputs. Our findings provide a comprehensive evaluation of LLMs' capabilities and limitations in temporal reasoning, highlighting key challenges in handling temporal data accurately. The GitHub repository for our work is available at https://github.com/gagan3012/EAIS-Temporal-Bias

Problem

Research questions and friction points this paper is trying to address.

Evaluating temporal biases in LLMs using DateLogicQA benchmark

Assessing tokenization quality with Semantic Integrity Metric

Analyzing Representation-Level and Logical-Level temporal biases

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces DateLogicQA benchmark for temporal reasoning

Proposes Semantic Integrity Metric for tokenization

Analyzes Representation and Logical-Level Biases

🔎 Similar Papers

Time Awareness in Large Language Models: Benchmarking Fact Recall Across Time