DetectiveQA: Evaluating Long-Context Reasoning on Detective Novels

📅 2024-09-04

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the limitations of large language models (LLMs) in evidence localization and multi-hop logical reasoning over long narrative texts—particularly detective fiction. To this end, we introduce DetectiveQA, the first bilingual, long-context evaluation benchmark specifically designed for narrative reasoning. Built upon original detective stories exceeding 100K tokens, it comprises 1,200 Chinese–English question-answer pairs with manually annotated step-by-step reasoning chains. We propose a novel step-level reasoning consistency metric, pioneering the use of classic detective fiction as an evaluation framework that simultaneously ensures realism, structural complexity, and interpretability. Comprehensive evaluation across state-of-the-art models—including GPT-4, Claude, and LLaMA—reveals significant bottlenecks in long-context evidence retrieval and chained logical inference. DetectiveQA thus establishes a new, rigorous, and reproducible benchmark for assessing and diagnosing long-context reasoning capabilities in LLMs.

Technology Category

Application Category

📝 Abstract

Recently, significant efforts have been devoted to enhancing the long-context capabilities of Large Language Models (LLMs), particularly in long-context reasoning. To facilitate this research, we propose extbf{DetectiveQA}, a dataset specifically designed for narrative reasoning within long contexts. We leverage detective novels, averaging over 100k tokens, to create a dataset containing 1200 human-annotated questions in both Chinese and English, each paired with corresponding reference reasoning steps. Furthermore, we introduce a step-wise reasoning metric, which enhances the evaluation of LLMs' reasoning processes. We validate our approach and evaluate the mainstream LLMs, including GPT-4, Claude, and LLaMA, revealing persistent long-context reasoning challenges and demonstrating their evidence-retrieval challenges. Our findings offer valuable insights into the study of long-context reasoning and lay the base for more rigorous evaluations.

Problem

Research questions and friction points this paper is trying to address.

Evaluating long-context reasoning in Large Language Models (LLMs).

Creating a dataset (DetectiveQA) for narrative reasoning in long contexts.

Introducing a step-wise metric to assess LLMs' reasoning processes.

Innovation

Methods, ideas, or system contributions that make the work stand out.

DetectiveQA dataset for narrative reasoning

Step-wise reasoning metric for LLM evaluation

Validation of LLMs on long-context challenges

🔎 Similar Papers

No similar papers found.