More Than a Score: Probing the Impact of Prompt Specificity on LLM Code Generation

📅 2025-08-05

📈 Citations: 0

✨ Influential: 0

career value

141K/year

🤖 AI Summary

Large language models (LLMs) exhibit strong performance on general code-generation benchmarks (e.g., HumanEval) but degrade significantly on domain-specific ones (e.g., ParEval), yet the underlying cause—particularly the role of prompt specificity—remains unclear. Method: We propose PartialOrderEval, a framework that constructs progressively refined prompt sequences to systematically isolate and evaluate the impact of input-output specifications, boundary condition descriptions, and stepwise reasoning on model outputs. Experiments are conducted on Llama-3.x and Qwen2.5-Coder across HumanEval and ParEval (including serial and OpenMP subsets). Contribution/Results: Prompt specificity substantially improves pass@1 scores—especially on specialized programming tasks—while model sensitivity to prompt details is task-dependent. This work provides the first quantitative characterization of how structured prompt elements interact with domain adaptability, establishing a reproducible methodology for domain-aware prompt engineering.

Technology Category

Application Category

📝 Abstract

State-of-the-art Large Language Models (LLMs) achieve high pass@1 on general benchmarks like HumanEval but underperform on specialized suites such as ParEval. Is this due to LLMs missing domain knowledge or insufficient prompt detail is given? To answer this, we introduce PartialOrderEval, which augments any code generation benchmark with a partial order of prompts from minimal to maximally detailed. Applying it to HumanEval and both serial and OpenMP subsets of ParEval, we measure how pass@1 scales with prompt specificity. Our experiments with Llama-3.x and Qwen2.5-Coder demonstrate varying degrees of prompt sensitivity across different tasks, and a qualitative analysis highlights explicit I/O specifications, edge-case handling, and stepwise breakdowns as the key drivers of prompt detail improvement.

Problem

Research questions and friction points this paper is trying to address.

Investigates how prompt specificity affects LLM code generation performance

Compares LLM performance on general vs specialized code benchmarks

Identifies key factors driving improvements in detailed prompts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces PartialOrderEval for prompt specificity analysis

Measures pass@1 scaling with detailed prompt variations

Identifies key drivers of prompt detail improvement

🔎 Similar Papers

No similar papers found.