🤖 AI Summary
Creative writing lacks reliable, automated evaluation standards. Method: We introduce LitBench—the first standardized benchmark for literary generation—featuring a large-scale, human-annotated story pairwise comparison dataset. Unlike zero-shot LLM judges, we construct debiased paired data and train a dedicated generative reward model (RM) grounded in the Bradley–Terry model. Contribution/Results: Our RM achieves 78% agreement with human preferences—surpassing Claude-3.7-Sonnet’s zero-shot performance (73%). An online human preference study further confirms its rankings align significantly with human judgments (p < 0.01). This work marks the first paradigm shift in creative writing evaluation—from heuristic, zero-shot judgment to a learnable, verifiable, and reproducible framework—establishing a rigorous foundation for controllable optimization of LLM-generated literature.
📝 Abstract
Evaluating creative writing generated by large language models (LLMs) remains challenging because open-ended narratives lack ground truths. Without performant automated evaluation methods, off-the-shelf (OTS) language models are employed as zero-shot judges, yet their reliability is unclear in this context. In pursuit of robust evaluation for creative writing, we introduce LitBench, the first standardized benchmark and paired dataset for creative writing verification, comprising a held-out test set of 2,480 debiased, human-labeled story comparisons drawn from Reddit and a 43,827-pair training corpus of human preference labels. Using LitBench, we (i) benchmark zero-shot LLM judges, (ii) train Bradley Terry and generative reward models, and (iii) conduct an online human study to validate reward model rankings on newly LLM-generated stories. Our benchmark identifies Claude-3.7-Sonnet as the strongest off-the-shelf judge, reaching 73% agreement with human preferences; among trained reward models, Bradley-Terry and Generative reward models both attain an accuracy of 78%, outperforming all off-the-shelf judges. An online human study further confirms that our trained reward models consistently align with human preferences in novel LLM-generated stories. We release LitBench and reward models at https://huggingface.co/collections/SAA-Lab/litbench-68267b5da3aafe58f9e43461, providing a vetted resource for reliable, automated evaluation and optimization of creative writing systems.