Deep Associations, High Creativity: A Simple yet Effective Metric for Evaluating Large Language Models

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Existing evaluation methods for large language model (LLM) creativity suffer from data contamination, high annotation costs, and lack of psychometric grounding. Method: We propose PACE (Parallel Associative Chain Evaluation), the first framework to adapt human creativity assessment paradigms—specifically parallel associative chaining—to LLMs. PACE automatically generates and quantifies associative chains using Spearman correlation analysis, linguistic concreteness metrics, and controlled human validation studies, enabling scalable, contamination-free evaluation across open- and closed-weight models. Contribution/Results: PACE achieves strong agreement with Chatbot Arena’s creative writing rankings (ρ > 0.85) and reveals that current LLMs exhibit significantly lower associative diversity than domain experts—though top-tier models approach the performance level of average human participants. The framework establishes a reproducible, interpretable, and cross-model benchmark for LLM creativity assessment, grounded in cognitive science principles and empirically validated across diverse architectures.

Technology Category

Application Category

📝 Abstract

The evaluation of LLMs' creativity represents a crucial research domain, though challenges such as data contamination and costly human assessments often impede progress. Drawing inspiration from human creativity assessment, we propose PACE, asking LLMs to generate Parallel Association Chains to Evaluate their creativity. PACE minimizes the risk of data contamination and offers a straightforward, highly efficient evaluation, as evidenced by its strong correlation with Chatbot Arena Creative Writing rankings (Spearman's $ρ= 0.739$, $p < 0.001$) across various proprietary and open-source models. A comparative analysis of associative creativity between LLMs and humans reveals that while high-performing LLMs achieve scores comparable to average human performance, professional humans consistently outperform LLMs. Furthermore, linguistic analysis reveals that both humans and LLMs exhibit a trend of decreasing concreteness in their associations, and humans demonstrating a greater diversity of associative patterns.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM creativity while minimizing data contamination risks

Providing efficient automated assessment alternative to human evaluation

Comparing associative creativity patterns between humans and language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes PACE metric for LLM creativity evaluation

Uses parallel association chains to assess creativity

Minimizes data contamination with efficient automated testing

🔎 Similar Papers

Divergent Creativity in Humans and Large Language Models