Prompt Variability Effects On LLM Code Generation

๐Ÿ“… 2025-06-11
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work investigates the sensitivity of large language models (LLMs) to prompt variations in code generation tasks and how this sensitivity is modulated by user backgroundโ€”such as software development experience. To address this, we propose the first synthetic evaluation pipeline and a persona-based systematic assessment framework specifically designed for code generation; the framework is task- and model-agnostic. Our methodology comprises synthetic prompt construction, persona modeling grounded in realistic developer profiles, multi-dimensional functional and quality evaluation (e.g., correctness, readability, efficiency), and a cross-model reproducible evaluation protocol. Experimental results demonstrate that prompts conditioned on distinct user personas significantly affect both functional correctness and holistic code quality. Moreover, our framework consistently uncovers latent behavioral biases across diverse LLMs. All evaluation code and artifacts are publicly released to support reproducibility and community advancement.

Technology Category

Application Category

๐Ÿ“ Abstract
Code generation is one of the most active areas of application of Large Language Models (LLMs). While LLMs lower barriers to writing code and accelerate development process, the overall quality of generated programs depends on the quality of given prompts. Specifically, functionality and quality of generated code can be sensitive to user's background and familiarity with software development. It is therefore important to quantify LLM's sensitivity to variations in the input. To this end we propose a synthetic evaluation pipeline for code generation with LLMs, as well as a systematic persona-based evaluation approach to expose qualitative differences of LLM responses dependent on prospective user background. Both proposed methods are completely independent from specific programming tasks and LLMs, and thus are widely applicable. We provide experimental evidence illustrating utility of our methods and share our code for the benefit of the community.
Problem

Research questions and friction points this paper is trying to address.

Quantify LLM sensitivity to prompt variations in code generation
Evaluate code quality differences based on user background
Propose task-agnostic methods for assessing LLM code generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthetic evaluation pipeline for LLM code generation
Persona-based evaluation for user background impact
Task and LLM independent widely applicable methods
๐Ÿ”Ž Similar Papers
No similar papers found.
Andrei Paleyes
Andrei Paleyes
PhD Candidate, University of Cambridge
Machine learningstatistical emulationsoftware
R
Radzim Sendyka
Department of Computer Science and Technology, University of Cambridge
D
Diana Robinson
Department of Computer Science and Technology, University of Cambridge
C
Christian Cabrera
Department of Computer Science and Technology, University of Cambridge
Neil D. Lawrence
Neil D. Lawrence
University of Cambridge
machine learningGaussian processes