🤖 AI Summary
Large language models (LLMs) exhibit strong performance on general code-generation benchmarks (e.g., HumanEval) but degrade significantly on domain-specific ones (e.g., ParEval), yet the underlying cause—particularly the role of prompt specificity—remains unclear.
Method: We propose PartialOrderEval, a framework that constructs progressively refined prompt sequences to systematically isolate and evaluate the impact of input-output specifications, boundary condition descriptions, and stepwise reasoning on model outputs. Experiments are conducted on Llama-3.x and Qwen2.5-Coder across HumanEval and ParEval (including serial and OpenMP subsets).
Contribution/Results: Prompt specificity substantially improves pass@1 scores—especially on specialized programming tasks—while model sensitivity to prompt details is task-dependent. This work provides the first quantitative characterization of how structured prompt elements interact with domain adaptability, establishing a reproducible methodology for domain-aware prompt engineering.
📝 Abstract
State-of-the-art Large Language Models (LLMs) achieve high pass@1 on general benchmarks like HumanEval but underperform on specialized suites such as ParEval. Is this due to LLMs missing domain knowledge or insufficient prompt detail is given? To answer this, we introduce PartialOrderEval, which augments any code generation benchmark with a partial order of prompts from minimal to maximally detailed. Applying it to HumanEval and both serial and OpenMP subsets of ParEval, we measure how pass@1 scales with prompt specificity. Our experiments with Llama-3.x and Qwen2.5-Coder demonstrate varying degrees of prompt sensitivity across different tasks, and a qualitative analysis highlights explicit I/O specifications, edge-case handling, and stepwise breakdowns as the key drivers of prompt detail improvement.