🤖 AI Summary
Large code models (LCMs) suffer from poor prompt generalizability, heavy reliance on manual prompt engineering, and incompatibility with black-box models. Method: This paper proposes the first automated prompt generation framework integrating Instruction Generation (IG) and Multi-Step Reasoning (MSR), requiring no access to model internals. It systematically enhances prompt effectiveness via structured instruction construction and stepwise semantic refinement. Contribution/Results: Evaluated across multiple open-source LCMs and the industrial WeChat-Bench benchmark, our approach achieves average improvements of 28.38% in CodeBLEU, 58.11% in ROUGE-L, 84.53% in SuccessRate@1, and 148.89% in MRR on code translation, summarization, and API recommendation tasks. We further uncover, for the first time, the synergistic gain mechanism between IG and MSR in code intelligence, establishing a reusable, task-agnostic automation paradigm for prompt engineering of black-box LCMs.
📝 Abstract
Large Code Models (LCMs) show potential in code intelligence, but their effectiveness is greatly influenced by prompt quality. Current prompt design is mostly manual, which is time-consuming and highly dependent on specific LCMs and tasks. While automated prompt generation (APG) exists in NLP, it is underexplored for code intelligence. This creates a gap, as automating the prompt process is essential for developers facing diverse tasks and black-box LCMs. To mitigate this, we empirically investigate two important parts of APG: Instruction Generation (IG) and Multi-Step Reasoning (MSR). IG provides a task-related description to instruct LCMs, while MSR guides them to produce logical steps before the final answer. We evaluate widely-used APG methods for each part on four open-source LCMs and three code intelligence tasks: code translation (PL-PL), code summarization (PL-NL), and API recommendation (NL-PL).Experimental results indicate that both IG and MSR dramatically enhance performance compared to basic prompts. Based on these results, we propose a novel APG approach combining the best methods of the two parts. Experiments show our approach achieves average improvements of 28.38% in CodeBLEU (code translation), 58.11% in ROUGE-L (code summarization), and 84.53% in SuccessRate@1 (API recommendation) over basic prompts. To validate its effectiveness in an industrial scenario, we evaluate our approach on WeChat-Bench, a proprietary dataset, achieving an average MRR improvement of 148.89% for API recommendation.