DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels

📅 2024-09-04
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of large language models (LLMs) in evidence localization and multi-hop logical reasoning over long narrative texts—particularly detective fiction. To this end, we introduce DetectiveQA, the first bilingual, long-context evaluation benchmark specifically designed for narrative reasoning. Built upon original detective stories exceeding 100K tokens, it comprises 1,200 Chinese–English question-answer pairs with manually annotated step-by-step reasoning chains. We propose a novel step-level reasoning consistency metric, pioneering the use of classic detective fiction as an evaluation framework that simultaneously ensures realism, structural complexity, and interpretability. Comprehensive evaluation across state-of-the-art models—including GPT-4, Claude, and LLaMA—reveals significant bottlenecks in long-context evidence retrieval and chained logical inference. DetectiveQA thus establishes a new, rigorous, and reproducible benchmark for assessing and diagnosing long-context reasoning capabilities in LLMs.

Technology Category

Application Category

📝 Abstract
Recently, significant efforts have been devoted to enhancing the long-context capabilities of Large Language Models (LLMs), particularly in long-context reasoning. To facilitate this research, we propose extbf{DetectiveQA}, a dataset specifically designed for narrative reasoning within long contexts. We leverage detective novels, averaging over 100k tokens, to create a dataset containing 1200 human-annotated questions in both Chinese and English, each paired with corresponding reference reasoning steps. Furthermore, we introduce a step-wise reasoning metric, which enhances the evaluation of LLMs' reasoning processes. We validate our approach and evaluate the mainstream LLMs, including GPT-4, Claude, and LLaMA, revealing persistent long-context reasoning challenges and demonstrating their evidence-retrieval challenges. Our findings offer valuable insights into the study of long-context reasoning and lay the base for more rigorous evaluations.
Problem

Research questions and friction points this paper is trying to address.

Evaluating long-context reasoning in Large Language Models (LLMs).
Creating a dataset (DetectiveQA) for narrative reasoning in long contexts.
Introducing a step-wise metric to assess LLMs' reasoning processes.
Innovation

Methods, ideas, or system contributions that make the work stand out.

DetectiveQA dataset for narrative reasoning
Step-wise reasoning metric for LLM evaluation
Validation of LLMs on long-context challenges
🔎 Similar Papers
No similar papers found.
Z
Zhe Xu
School of Computer Science, Fudan University
Jiasheng Ye
Jiasheng Ye
Fudan University
Large Language ModelsGenerative ModelsAI Scientists
Xiangyang Liu
Xiangyang Liu
School of Computer Science, Fudan University
T
Tianxiang Sun
School of Computer Science, Fudan University
Xiaoran Liu
Xiaoran Liu
Fudan University
natural language processing
Qipeng Guo
Qipeng Guo
Fudan University
Linlin Li
Linlin Li
Huawei Noah’s Ark Lab
Q
Qun Liu
Huawei Noah’s Ark Lab
X
Xuanjing Huang
School of Computer Science, Fudan University
X
Xipeng Qiu
School of Computer Science, Fudan University, Shanghai AI Laboratory