LitBench: A Benchmark and Dataset for Reliable Evaluation of Creative Writing

📅 2025-07-01

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

Creative writing lacks reliable, automated evaluation standards. Method: We introduce LitBench—the first standardized benchmark for literary generation—featuring a large-scale, human-annotated story pairwise comparison dataset. Unlike zero-shot LLM judges, we construct debiased paired data and train a dedicated generative reward model (RM) grounded in the Bradley–Terry model. Contribution/Results: Our RM achieves 78% agreement with human preferences—surpassing Claude-3.7-Sonnet’s zero-shot performance (73%). An online human preference study further confirms its rankings align significantly with human judgments (p < 0.01). This work marks the first paradigm shift in creative writing evaluation—from heuristic, zero-shot judgment to a learnable, verifiable, and reproducible framework—establishing a rigorous foundation for controllable optimization of LLM-generated literature.

Technology Category

Application Category

📝 Abstract

Evaluating creative writing generated by large language models (LLMs) remains challenging because open-ended narratives lack ground truths. Without performant automated evaluation methods, off-the-shelf (OTS) language models are employed as zero-shot judges, yet their reliability is unclear in this context. In pursuit of robust evaluation for creative writing, we introduce LitBench, the first standardized benchmark and paired dataset for creative writing verification, comprising a held-out test set of 2,480 debiased, human-labeled story comparisons drawn from Reddit and a 43,827-pair training corpus of human preference labels. Using LitBench, we (i) benchmark zero-shot LLM judges, (ii) train Bradley Terry and generative reward models, and (iii) conduct an online human study to validate reward model rankings on newly LLM-generated stories. Our benchmark identifies Claude-3.7-Sonnet as the strongest off-the-shelf judge, reaching 73% agreement with human preferences; among trained reward models, Bradley-Terry and Generative reward models both attain an accuracy of 78%, outperforming all off-the-shelf judges. An online human study further confirms that our trained reward models consistently align with human preferences in novel LLM-generated stories. We release LitBench and reward models at https://huggingface.co/collections/SAA-Lab/litbench-68267b5da3aafe58f9e43461, providing a vetted resource for reliable, automated evaluation and optimization of creative writing systems.

Problem

Research questions and friction points this paper is trying to address.

Evaluating creative writing by LLMs lacks reliable ground truths.

Existing zero-shot LLM judges have unclear reliability for creative writing.

LitBench provides standardized benchmark for creative writing evaluation.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces LitBench benchmark for creative writing evaluation

Trains Bradley Terry and generative reward models

Validates reward models with human study on new stories

🔎 Similar Papers

Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks

2024-06-12Citations: 0

Scale AI

$264,800—$331,000 USD

San Francisco / New York / Seattle

Research Scientist Intern, Multimodal AI (PhD)