CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization

📅 2025-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Automated evaluation of creative texts (e.g., stories) suffers from inter-annotator subjectivity, leading to unreliable scores; moreover, self-consistency (SC) methods—optimized for explanation fluency—exhibit objective misalignment, compromising scoring accuracy. Method: We propose a two-stage reasoning paradigm: first, generating customizable “Chain-of-Keywords” to explicitly guide fine-grained evaluation dimensions; second, conditioning rationale generation and scoring on this keyword chain. Contribution/Results: This work is the first to identify and rectify the objective misalignment of Chain-of-Thought (CoT) and SC in subjective assessment tasks. By integrating keyword-diversity sampling, multi-path score aggregation, and chain-constrained generation, our method achieves human-level performance on the StoryER benchmark—attaining a correlation with human judgments twice that of GPT-4 while reducing parameter count by over an order of magnitude.

Technology Category

Application Category

📝 Abstract
Evaluating creative text such as human-written stories using language models has always been a challenging task -- owing to the subjectivity of multi-annotator ratings. To mimic the thinking process of humans, chain of thought (CoT) generates free-text explanations that help guide a model's predictions and Self-Consistency (SC) marginalizes predictions over multiple generated explanations. In this study, we discover that the widely-used self-consistency reasoning methods cause suboptimal results due to an objective mismatch between generating 'fluent-looking' explanations vs. actually leading to a good rating prediction for an aspect of a story. To overcome this challenge, we propose $ extbf{C}$hain-$ extbf{o}$f-$ extbf{Ke}$ywords (CoKe), that generates a sequence of keywords $ extit{before}$ generating a free-text rationale, that guide the rating prediction of our evaluation language model. Then, we generate a diverse set of such keywords, and aggregate the scores corresponding to these generations. On the StoryER dataset, CoKe based on our small fine-tuned evaluation models not only reach human-level performance and significantly outperform GPT-4 with a 2x boost in correlation with human annotators, but also requires drastically less number of parameters.
Problem

Research questions and friction points this paper is trying to address.

Improving story evaluation accuracy using customizable keyword-guided rationales
Addressing subjectivity in multi-annotator ratings for creative text assessment
Reducing model size while outperforming human-level evaluation performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates keywords before rationale for guidance
Diverse keyword aggregation boosts performance
Small models outperform GPT-4 efficiently
🔎 Similar Papers
No similar papers found.