EvolvR: Self-Evolving Pairwise Reasoning for Story Evaluation to Enhance Generation

📅 2025-08-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the limitations of LLM-as-a-judge in open-ended story evaluation—including insufficient reasoning capability, poor adaptability of closed-source models to prompt engineering, and inadequate reasoning support in fine-tuning open-source models—this paper proposes a self-evolving pairwise reasoning framework. Our method leverages multi-role self-synthesis and multi-agent self-filtering to generate and refine high-quality, chain-of-thought (CoT)-style pairwise comparison data, enabling autonomous evolution of evaluation models with score alignment. Integrating pairwise comparison, multi-agent collaboration, CoT-based reasoning, and supervised fine-tuning, the framework establishes an iteratively optimizable evaluation system. Evaluated on three major benchmarks—StoryER, HANNA, and OpenMEVA—our approach achieves state-of-the-art performance and significantly improves downstream story generation quality.

Technology Category

Application Category

📝 Abstract
Although the effectiveness of Large Language Models (LLMs) as judges (LLM-as-a-judge) has been validated, their performance remains limited in open-ended tasks, particularly in story evaluation. Accurate story evaluation is crucial not only for assisting human quality judgment but also for providing key signals to guide story generation. However, existing methods face a dilemma: prompt engineering for closed-source models suffers from poor adaptability, while fine-tuning approaches for open-source models lack the rigorous reasoning capabilities essential for story evaluation. To address this, we propose the Self-Evolving Pairwise Reasoning (EvolvR) framework. Grounded in pairwise comparison, the framework first self-synthesizes score-aligned Chain-of-Thought (CoT) data via a multi-persona strategy. To ensure data quality, these raw CoTs undergo a self-filtering process, utilizing multi-agents to guarantee their logical rigor and robustness. Finally, the evaluator trained on the refined data is deployed as a reward model to guide the story generation task. Experimental results demonstrate that our framework achieves state-of-the-art (SOTA) performance on three evaluation benchmarks including StoryER, HANNA and OpenMEVA. Furthermore, when served as a reward model, it significantly enhances the quality of generated stories, thereby fully validating the superiority of our self-evolving approach.
Problem

Research questions and friction points this paper is trying to address.

Improving story evaluation accuracy in open-ended tasks
Enhancing reasoning capabilities for LLM-based story assessment
Bridging adaptability and rigor gaps in evaluation methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-synthesizes score-aligned Chain-of-Thought data
Employs multi-agent self-filtering for logical rigor
Deploys evaluator as reward model for generation
🔎 Similar Papers
No similar papers found.
Xinda Wang
Xinda Wang
University of Texas at Dallas
Software SecurityAI SecuritySystems Security
Z
Zhengxu Hou
Alibaba Group
Y
Yangshijie Zhang
Alibaba Group, Lanzhou University
B
Bingren Yan
Alibaba Group
Z
Zhibo Yang
Alibaba Group
X
Xingsheng Zhang
Alibaba Group, University of the Chinese Academy of Sciences
Luxi Xing
Luxi Xing
Institute of Information Engineering, Chinese Academy of Sciences
Natural Language Processing
Q
Qiang Zhou
Alibaba Group
C
Chen Zhang
Alibaba Group