🤖 AI Summary
Large language models (LLMs) for code generation suffer from strong dependence on prompt quality, while manual prompt engineering is inefficient and inconsistent.
Method: This paper introduces the first adaptive prompt alchemy framework specifically designed for code generation. It integrates multi-agent collaborative optimization, model-feedback-driven iterative prompt search, task-performance-guided automated evaluation and selection, and cross-model generalization adaptation—ensuring inference consistency and enabling plug-and-play deployment.
Results: On HumanEval, zero-shot performance improves by 5.0% for GPT-3.5-Turbo and 1.9% for GPT-4o; Java↔Python cross-language translation accuracy increases by up to 17.1%; and integration with state-of-the-art methods (e.g., LDB) yields consistent gains of 1.2–1.8%. The framework thus advances automated, robust, and generalizable prompt optimization for code-generation LLMs.
📝 Abstract
Code generation has emerged as a key task to automate software development by converting high-level descriptions into executable code. Large language models (LLMs) excel at this but depend heavily on input prompt quality.Manual prompt engineering can be time-consuming and inconsistent, limiting LLM effectiveness. This paper introduces Prochemy, an innovative method for automatically refining prompts to boost code generation. Prochemy overcomes manual prompt limitations by automating optimization, ensuring consistency during inference, and supporting multi-agent systems.It iteratively refines prompts based on model performance, using an optimized final prompt for improved consistency across tasks. We tested Prochemy on natural language-based code generation and translation tasks using three LLM series. Results indicate Prochemy enhances existing methods, improving performance by 5.0% for GPT-3.5-Turbo and 1.9% for GPT-4o over zero-shot baselines on HumanEval. In state-of-the-art LDB, Prochemy + LDB surpasses standalone methods by 1.2-1.8%. For code translation, Prochemy boosts GPT-4o's Java-to-Python (AVATAR) performance from 74.5 to 84.1 (+12.9%) and Python-to-Java from 66.8 to 78.2 (+17.1%). Moreover, Prochemy maintains strong performance when integrated with the o1-mini model, validating its efficacy in code tasks. Designed as plug-and-play, Prochemy optimizes prompts with minimal human input, bridging the gap between simple prompts and complex frameworks.