Evaluating Text Creativity across Diverse Domains: A Dataset and Large Language Model Evaluator

📅 2025-05-25

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Evaluating the creativity of large language models (LLMs) has long suffered from high annotation costs, poor generalizability of automated metrics, and low inter-annotator agreement. To address these challenges, this work introduces CreataSet—the first large-scale, cross-domain textual creativity benchmark—comprising over 100K human-annotated and 1M synthetically generated samples. We propose a context-aware pairwise comparison framework and develop CrEval, an automated evaluator trained for high human alignment. Our approach innovatively integrates human-in-the-loop data with instruction-shared pairwise learning, substantially improving evaluation consistency, cross-domain generalizability, and scalability. Experiments demonstrate that CrEval consistently outperforms existing automated methods across diverse domains and achieves significantly higher human preference alignment. Moreover, we provide the first empirical validation of the critical role of synergistic synthetic–human data in enhancing evaluation robustness—a finding that further enables iterative improvement of LLMs’ generative creativity.

Technology Category

Application Category

📝 Abstract

Creativity evaluation remains a challenging frontier for large language models (LLMs). Current evaluations heavily rely on inefficient and costly human judgments, hindering progress in enhancing machine creativity. While automated methods exist, ranging from psychological testing to heuristic- or prompting-based approaches, they often lack generalizability or alignment with human judgment. To address these issues, in this paper, we propose a novel pairwise-comparison framework for assessing textual creativity, leveraging shared contextual instructions to improve evaluation consistency. We introduce CreataSet, a large-scale dataset with 100K+ human-level and 1M+ synthetic creative instruction-response pairs spanning diverse open-domain tasks. Through training on CreataSet, we develop an LLM-based evaluator named CrEval. CrEval demonstrates remarkable superiority over existing methods in alignment with human judgments. Experimental results underscore the indispensable significance of integrating both human-generated and synthetic data in training highly robust evaluators, and showcase the practical utility of CrEval in boosting the creativity of LLMs. We will release all data, code, and models publicly soon to support further research.

Problem

Research questions and friction points this paper is trying to address.

Challenges in evaluating creativity of large language models

Lack of generalizable automated creativity evaluation methods

Need for scalable human-aligned creativity assessment framework

Innovation

Methods, ideas, or system contributions that make the work stand out.

Pairwise-comparison framework for creativity evaluation

CreataSet dataset with human and synthetic pairs

LLM-based evaluator CrEval trained on CreataSet

🔎 Similar Papers

Divergent Creativity in Humans and Large Language Models