🤖 AI Summary
Existing evaluation frameworks lack rigorous, cognitively grounded metrics for quantifying code-generation creativity in large language models (LLMs), particularly regarding convergent (goal-directed) and divergent (constraint-adaptive) thinking.
Method: We propose DENIAL PROMPTING—a constraint-iterative prompting technique—and NEOGAUGE, a dual-dimensional metric measuring both convergent and divergent creativity. We also introduce NEOCODER, a benchmark dataset comprising real Codeforces programming problems and human-written solutions.
Contribution/Results: Through systematic evaluation across diverse open- and closed-source LLMs, as well as advanced reasoning strategies (e.g., MCTS, self-refinement), we validate NEOGAUGE’s reliability and cognitive interpretability. Results show that state-of-the-art LLMs—including GPT-4—exhibit substantially lower creative performance than humans, and current techniques yield only marginal improvements. Our framework establishes the first reproducible, cognition-aware paradigm for assessing LLM code-generation creativity.
📝 Abstract
As LLMs become increasingly prevalent, it is interesting to consider how ``creative'' these models can be. From cognitive science, creativity consists of at least two key characteristics: emph{convergent} thinking (purposefulness to achieve a given goal) and emph{divergent} thinking (adaptability to explore new environments or constraints) citep{runco2003critical}. In this work, we introduce a framework for quantifying LLM creativity that incorporates the two design ingredients: (1) We introduce DENIAL PROMPTING which pushes LLMs to develop more creative solutions to a given problem by incrementally imposing new constraints on the previous solution, compelling LLMs to adopt new strategies. (2) We define NEOGAUGE, a metric that quantifies both convergent and divergent thinking in the generated creative responses by LLMs. We test the proposed framework on Codeforces problems, which serve as both a natural dataset for coding tasks and a collection of prior human solutions. We quantify NEOGAUGE for various proprietary and open-source models and find that even the most creative model, GPT-4, still falls short of demonstrating human-like creativity. We also experiment with advanced reasoning strategies (MCTS, self-correction, etc.) and observe no significant improvement in creativity. As a by-product of our analysis, we release NEOCODER dataset for reproducing our results on future models.