🤖 AI Summary
Prior work lacks a systematic understanding of how input parameters—such as prompt design, temperature, number of candidate solutions, and context—affect code generation in language models, hindering their reliable deployment.
Method: This paper conducts the first controlled experiments on GitHub Copilot and OpenAI Codex, establishing a reproducible parameter perturbation framework grounded in HumanEval and LeetCode benchmarks.
Contribution/Results: We empirically uncover strong, nonlinear couplings among temperature, prompt formulation, and candidate count—demonstrating that optimizing any single parameter in isolation is ineffective and that joint parameter tuning is essential. This challenges conventional manual hyperparameter tuning and provides both theoretical grounding and empirical evidence for automated parameter optimization. Experimental results show substantial correctness improvements under coordinated tuning; however, optimal configurations exhibit high sensitivity to parameter changes, underscoring the necessity of systematic, holistic parameter control.
📝 Abstract
Language models are promising solutions for tackling increasing complex problems. In software engineering, they recently attracted attention in code assistants, with programs automatically written in a given programming language from a programming task description in natural language. They have the potential to save time and effort when writing code. However, these systems are currently poorly understood, preventing them from being used optimally. In this paper, we investigate the various input parameters of two language models, and conduct a study to understand if variations of these input parameters (e.g. programming task description and the surrounding context, creativity of the language model, number of generated solutions) can have a significant impact on the quality of the generated programs. We design specific operators for varying input parameters and apply them over two code assistants (Copilot and Codex) and two benchmarks representing algorithmic problems (HumanEval and LeetCode). Our results showed that varying the input parameters can significantly improve the performance of language models. However, there is a tight dependency when varying the temperature, the prompt and the number of generated solutions, making potentially hard for developers to properly control the parameters to obtain an optimal result. This work opens opportunities to propose (automated) strategies for improving performance.