CoKe: Customizable Fine-Grained Story Evaluation via Chain-of-Keyword Rationalization

📅 2025-03-21

📈 Citations: 0

✨ Influential: 0

career value

140K/year

🤖 AI Summary

Automated evaluation of creative texts (e.g., stories) suffers from inter-annotator subjectivity, leading to unreliable scores; moreover, self-consistency (SC) methods—optimized for explanation fluency—exhibit objective misalignment, compromising scoring accuracy. Method: We propose a two-stage reasoning paradigm: first, generating customizable “Chain-of-Keywords” to explicitly guide fine-grained evaluation dimensions; second, conditioning rationale generation and scoring on this keyword chain. Contribution/Results: This work is the first to identify and rectify the objective misalignment of Chain-of-Thought (CoT) and SC in subjective assessment tasks. By integrating keyword-diversity sampling, multi-path score aggregation, and chain-constrained generation, our method achieves human-level performance on the StoryER benchmark—attaining a correlation with human judgments twice that of GPT-4 while reducing parameter count by over an order of magnitude.

Technology Category

Application Category

📝 Abstract

Evaluating creative text such as human-written stories using language models has always been a challenging task -- owing to the subjectivity of multi-annotator ratings. To mimic the thinking process of humans, chain of thought (CoT) generates free-text explanations that help guide a model's predictions and Self-Consistency (SC) marginalizes predictions over multiple generated explanations. In this study, we discover that the widely-used self-consistency reasoning methods cause suboptimal results due to an objective mismatch between generating 'fluent-looking' explanations vs. actually leading to a good rating prediction for an aspect of a story. To overcome this challenge, we propose $ extbf{C}$hain-$ extbf{o}$f-$ extbf{Ke}$ywords (CoKe), that generates a sequence of keywords $ extit{before}$ generating a free-text rationale, that guide the rating prediction of our evaluation language model. Then, we generate a diverse set of such keywords, and aggregate the scores corresponding to these generations. On the StoryER dataset, CoKe based on our small fine-tuned evaluation models not only reach human-level performance and significantly outperform GPT-4 with a 2x boost in correlation with human annotators, but also requires drastically less number of parameters.

Problem

Research questions and friction points this paper is trying to address.

Improving story evaluation accuracy using customizable keyword-guided rationales

Addressing subjectivity in multi-annotator ratings for creative text assessment

Reducing model size while outperforming human-level evaluation performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates keywords before rationale for guidance

Diverse keyword aggregation boosts performance

Small models outperform GPT-4 efficiently

🔎 Similar Papers

No similar papers found.