🤖 AI Summary
Prior work lacks systematic, cross-model evaluation of prompt engineering techniques for software engineering (SE) tasks. Method: This study conducts an empirical analysis of 14 prompt engineering techniques across 10 SE tasks—including code generation, bug fixing, and code question answering—spanning six paradigms: zero-shot, few-shot, chain-of-thought (CoT), ensemble, self-critique, and decomposition. Experiments are performed on LLaMA-3, CodeLlama, GPT-4, and Claude-3. Contribution/Results: We introduce the first multi-dimensional, task-aware prompt engineering benchmark framework tailored to SE. Our analysis reveals a strong correlation between task logical complexity and prompt strategy efficacy: CoT and decomposition improve accuracy by +18.7% on high-reasoning tasks, whereas few-shot excels on context-sensitive tasks. We propose a principled prompt selection guideline grounded in linguistic features and resource overhead (latency/token cost), and publicly release a reusable decision table and overhead evaluation toolkit.
📝 Abstract
A growing variety of prompt engineering techniques has been proposed for Large Language Models (LLMs), yet systematic evaluation of each technique on individual software engineering (SE) tasks remains underexplored. In this study, we present a systematic evaluation of 14 established prompt techniques across 10 SE tasks using four LLM models. As identified in the prior literature, the selected prompting techniques span six core dimensions (Zero-Shot, Few-Shot, Thought Generation, Ensembling, Self-Criticism, and Decomposition). They are evaluated on tasks such as code generation, bug fixing, and code-oriented question answering, to name a few. Our results show which prompting techniques are most effective for SE tasks requiring complex logic and intensive reasoning versus those that rely more on contextual understanding and example-driven scenarios. We also analyze correlations between the linguistic characteristics of prompts and the factors that contribute to the effectiveness of prompting techniques in enhancing performance on SE tasks. Additionally, we report the time and token consumption for each prompting technique when applied to a specific task and model, offering guidance for practitioners in selecting the optimal prompting technique for their use cases.