Prompt Alchemy: Automatic Prompt Refinement for Enhancing Code Generation

📅 2025-03-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) for code generation suffer from strong dependence on prompt quality, while manual prompt engineering is inefficient and inconsistent. Method: This paper introduces the first adaptive prompt alchemy framework specifically designed for code generation. It integrates multi-agent collaborative optimization, model-feedback-driven iterative prompt search, task-performance-guided automated evaluation and selection, and cross-model generalization adaptation—ensuring inference consistency and enabling plug-and-play deployment. Results: On HumanEval, zero-shot performance improves by 5.0% for GPT-3.5-Turbo and 1.9% for GPT-4o; Java↔Python cross-language translation accuracy increases by up to 17.1%; and integration with state-of-the-art methods (e.g., LDB) yields consistent gains of 1.2–1.8%. The framework thus advances automated, robust, and generalizable prompt optimization for code-generation LLMs.

Technology Category

Application Category

📝 Abstract
Code generation has emerged as a key task to automate software development by converting high-level descriptions into executable code. Large language models (LLMs) excel at this but depend heavily on input prompt quality.Manual prompt engineering can be time-consuming and inconsistent, limiting LLM effectiveness. This paper introduces Prochemy, an innovative method for automatically refining prompts to boost code generation. Prochemy overcomes manual prompt limitations by automating optimization, ensuring consistency during inference, and supporting multi-agent systems.It iteratively refines prompts based on model performance, using an optimized final prompt for improved consistency across tasks. We tested Prochemy on natural language-based code generation and translation tasks using three LLM series. Results indicate Prochemy enhances existing methods, improving performance by 5.0% for GPT-3.5-Turbo and 1.9% for GPT-4o over zero-shot baselines on HumanEval. In state-of-the-art LDB, Prochemy + LDB surpasses standalone methods by 1.2-1.8%. For code translation, Prochemy boosts GPT-4o's Java-to-Python (AVATAR) performance from 74.5 to 84.1 (+12.9%) and Python-to-Java from 66.8 to 78.2 (+17.1%). Moreover, Prochemy maintains strong performance when integrated with the o1-mini model, validating its efficacy in code tasks. Designed as plug-and-play, Prochemy optimizes prompts with minimal human input, bridging the gap between simple prompts and complex frameworks.
Problem

Research questions and friction points this paper is trying to address.

Automates prompt refinement to enhance code generation efficiency.
Overcomes manual prompt engineering limitations with automated optimization.
Improves LLM performance in code generation and translation tasks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automates prompt refinement for code generation
Iteratively optimizes prompts based on model performance
Enhances LLM performance with minimal human input
🔎 Similar Papers
S
Sixiang Ye
Beijing University of Chemical Technology, Beijing, China
Z
Zeyu Sun
National Key Laboratory of Space Integrated Information System, Institute of Software, Chinese Academy of Sciences, Beijing, China
G
Guoqing Wang
Peking University, Beijing, China
L
Liwei Guo
Beijing University of Chemical Technology, Beijing, China
Qingyuan Liang
Qingyuan Liang
Peking University
Software EngineeringCode Generation
Z
Zheng Li
Beijing University of Chemical Technology, Beijing, China
Y
Yong Liu
Beijing University of Chemical Technology, Beijing, China