🤖 AI Summary
This study addresses the need to quantify the sources of variation in large language models’ outputs on creative tasks, disentangling the contributions of prompts, model choice, and sampling randomness. By generating 100 samples per prompt across 12 models for 10 creative prompts (yielding 12,000 total outputs), the authors employ variance decomposition to systematically analyze the components of variance in originality and fluency. Their analysis reveals, for the first time, that prompt selection accounts for 36.43% of the variance in originality—nearly matching the 40.94% attributable to model choice. In contrast, fluency is predominantly driven by model choice (51.25%) and within-model stochasticity (33.70%), with prompts contributing only 4.22%. These findings underscore the susceptibility of single-sample evaluations to sampling noise and highlight the necessity of multi-sample generation and controlled experimental designs in assessing creative language generation.
📝 Abstract
How much of LLM output variance is explained by prompts versus model choice versus stochasticity through sampling? We answer this by evaluating 12 LLMs on 10 creativity prompts with 100 samples each (N = 12,000). For output quality (originality), prompts explain 36.43% of variance, comparable to model choice (40.94%). But for output quantity (fluency), model choice (51.25%) and within-LLM variance (33.70%) dominate, with prompts explaining only 4.22%. Prompts are powerful levers for steering output quality, but given the substantial within-LLM variance (10-34%), single-sample evaluations risk conflating sampling noise with genuine prompt or model effects.