WHODUNIT: Evaluation benchmark for culprit detection in mystery stories

📅 2025-02-11

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work evaluates the causal reasoning capabilities of large language models (LLMs) in identifying perpetrators within detective fiction. To this end, we introduce the first benchmark for culprit identification grounded in authentic mystery literature. We propose multi-granularity character-name perturbations—namely, original names, name swapping, and high-profile entity substitution—to rigorously assess model robustness. To enhance evaluation reliability, we incorporate character-level name augmentation, diverse prompting strategies, and a majority-voting ensemble mechanism. Systematic experiments on the GPT-4o series demonstrate that state-of-the-art models achieve robust accuracy on unperturbed texts but suffer significant performance degradation under high-profile entity substitution, revealing their strong reliance on semantic anchors. This work establishes the first causally grounded literary reasoning evaluation framework, offering a novel paradigm and a reproducible benchmark for probing deep reasoning capabilities in LLMs.

Technology Category

Application Category

📝 Abstract

We present a novel data set, WhoDunIt, to assess the deductive reasoning capabilities of large language models (LLM) within narrative contexts. Constructed from open domain mystery novels and short stories, the dataset challenges LLMs to identify the perpetrator after reading and comprehending the story. To evaluate model robustness, we apply a range of character-level name augmentations, including original names, name swaps, and substitutions with well-known real and/or fictional entities from popular discourse. We further use various prompting styles to investigate the influence of prompting on deductive reasoning accuracy. We conduct evaluation study with state-of-the-art models, specifically GPT-4o, GPT-4-turbo, and GPT-4o-mini, evaluated through multiple trials with majority response selection to ensure reliability. The results demonstrate that while LLMs perform reliably on unaltered texts, accuracy diminishes with certain name substitutions, particularly those with wide recognition. This dataset is publicly available here.

Problem

Research questions and friction points this paper is trying to address.

Evaluate LLM deductive reasoning in narratives

Assess robustness via character name augmentations

Investigate prompting styles' impact on accuracy

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dataset from mystery novels for LLMs

Character-level name augmentations applied

Multiple prompting styles tested

🔎 Similar Papers

Are LLMs Good Cryptic Crossword Solvers?