CreativityPrism: A Holistic Benchmark for Large Language Model Creativity

📅 2025-10-22

📈 Citations: 0

✨ Influential: 0

career value

219K/year

🤖 AI Summary

Current evaluations of large language model (LLM) creativity lack a unified framework, resulting in inconsistent definitions and metrics that hinder cross-task and cross-domain comparability. To address this, we propose the first decoupled three-dimensional evaluation framework—assessing quality, novelty, and diversity—spanning divergent thinking, creative writing, and logical reasoning. The framework integrates nine representative tasks and twenty task-specific metrics. Empirical analysis reveals weak inter-correlation among the three dimensions, with novelty exhibiting particular independence; strong intra-domain performance correlation but limited cross-domain generalization; and consistent superiority of proprietary models over open-source counterparts across all dimensions. This work establishes a systematic, standardized, and comparable paradigm for LLM creativity assessment, enabling rigorous, multi-faceted analysis of generative capabilities.

Technology Category

Application Category

📝 Abstract

Creativity is often seen as a hallmark of human intelligence. While large language models (LLMs) are increasingly perceived as producing creative text, there is still no holistic framework to evaluate their creativity across diverse scenarios. Existing evaluation methods remain fragmented, with dramatic variation across domains and tasks, largely due to differing definitions and measurements of creativity. Inspired by the hypothesis that creativity is not one fixed idea, we propose CreativityPrism, an evaluation analysis framework that decomposes creativity into three dimensions: quality, novelty, and diversity. CreativityPrism incorporates nine tasks, three domains, i.e., divergent thinking, creative writing, and logical reasoning, and twenty evaluation metrics, which measure each dimension in task-specific, unique ways. We evaluate 17 state-of-the-art (SoTA) proprietary and open-sourced LLMs on CreativityPrism and analyze the performance correlations among different metrics and task domains. Our results reveal a notable gap between proprietary and open-source models. Overall, model performance tends to be highly correlated across tasks within the same domain and less so across different domains. Among evaluation dimensions, diversity and quality metrics show strong correlations - models that perform well on one often excel on the other - whereas novelty exhibits much weaker correlation with either. These findings support our hypothesis that strong performance in one creativity task or dimension does not necessarily generalize to others, underscoring the need for a holistic evaluation of LLM creativity.

Problem

Research questions and friction points this paper is trying to address.

Develop holistic framework for evaluating LLM creativity

Decompose creativity into quality novelty diversity dimensions

Analyze performance gaps between proprietary and open-source models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposes creativity into three key dimensions

Incorporates nine tasks across three diverse domains

Uses twenty metrics for holistic model evaluation

🔎 Similar Papers

Divergent Creativity in Humans and Large Language Models