🤖 AI Summary
Automated evaluation of creative texts (e.g., stories) suffers from inter-annotator subjectivity, leading to unreliable scores; moreover, self-consistency (SC) methods—optimized for explanation fluency—exhibit objective misalignment, compromising scoring accuracy. Method: We propose a two-stage reasoning paradigm: first, generating customizable “Chain-of-Keywords” to explicitly guide fine-grained evaluation dimensions; second, conditioning rationale generation and scoring on this keyword chain. Contribution/Results: This work is the first to identify and rectify the objective misalignment of Chain-of-Thought (CoT) and SC in subjective assessment tasks. By integrating keyword-diversity sampling, multi-path score aggregation, and chain-constrained generation, our method achieves human-level performance on the StoryER benchmark—attaining a correlation with human judgments twice that of GPT-4 while reducing parameter count by over an order of magnitude.
📝 Abstract
Evaluating creative text such as human-written stories using language models has always been a challenging task -- owing to the subjectivity of multi-annotator ratings. To mimic the thinking process of humans, chain of thought (CoT) generates free-text explanations that help guide a model's predictions and Self-Consistency (SC) marginalizes predictions over multiple generated explanations. In this study, we discover that the widely-used self-consistency reasoning methods cause suboptimal results due to an objective mismatch between generating 'fluent-looking' explanations vs. actually leading to a good rating prediction for an aspect of a story. To overcome this challenge, we propose $ extbf{C}$hain-$ extbf{o}$f-$ extbf{Ke}$ywords (CoKe), that generates a sequence of keywords $ extit{before}$ generating a free-text rationale, that guide the rating prediction of our evaluation language model. Then, we generate a diverse set of such keywords, and aggregate the scores corresponding to these generations. On the StoryER dataset, CoKe based on our small fine-tuned evaluation models not only reach human-level performance and significantly outperform GPT-4 with a 2x boost in correlation with human annotators, but also requires drastically less number of parameters.