🤖 AI Summary
This study addresses the challenge of automatic quality assessment for long novels (>100K tokens). To this end, we introduce LongStoryEval—the first large-scale benchmark for long-story evaluation—comprising 600 newly published novels (average length: 121K tokens), with aspect-level annotations derived from real user reviews to identify core dimensions such as plot, character development, and thematic depth. We propose a multi-strategy evaluation framework that systematically compares aggregation-based, incremental, and summarization-based approaches, revealing the superiority of the latter. Furthermore, we establish the first structured evaluation criteria specifically designed for long narratives and leverage them to fine-tune a lightweight 8B-parameter model, NovelCritique. Experiments demonstrate that NovelCritique achieves significantly higher correlation with human judgments across multiple dimensions than GPT-4o. The LongStoryEval benchmark, source code, and model are publicly released, establishing a new paradigm for evaluating long-text generation.
📝 Abstract
In this work, we conduct systematic research in a challenging area: the automatic evaluation of book-length stories (>100K tokens). Our study focuses on two key questions: (1) understanding which evaluation aspects matter most to readers, and (2) exploring effective methods for evaluating lengthy stories. We introduce the first large-scale benchmark, LongStoryEval, comprising 600 newly published books with an average length of 121K tokens (maximum 397K). Each book includes its average rating and multiple reader reviews, presented as critiques organized by evaluation aspects. By analyzing all user-mentioned aspects, we propose an evaluation criteria structure and conduct experiments to identify the most significant aspects among the 8 top-level criteria. For evaluation methods, we compare the effectiveness of three types: aggregation-based, incremental-updated, and summary-based evaluations. Our findings reveal that aggregation- and summary-based evaluations perform better, with the former excelling in detail assessment and the latter offering greater efficiency. Building on these insights, we further propose NovelCritique, an 8B model that leverages the efficient summary-based framework to review and score stories across specified aspects. NovelCritique outperforms commercial models like GPT-4o in aligning with human evaluations. Our datasets and codes are available at https://github.com/DingyiYang/LongStoryEval.