When Prompts Go Wrong: Evaluating Code Model Robustness to Ambiguous, Contradictory, and Incomplete Task Descriptions

📅 2025-07-27

📈 Citations: 0

✨ Influential: 0

career value

214K/year

🤖 AI Summary

This work investigates the robustness of code generation models under ambiguous, contradictory, and incomplete task descriptions—conditions poorly represented in existing benchmarks. Method: To address the lack of realistic prompt imperfections in current evaluation suites, we propose a guided mutation strategy applied to HumanEval and MBPP, systematically constructing an evaluation dataset featuring three categories of semantic defects. Contribution/Results: We conduct the first empirical analysis of large language models for code across multiple scales and architectures. Results show that even minor prompt flaws cause substantial drops in functional correctness—especially under contradictory instructions, where models consistently produce logical errors. Error patterns strongly correlate with description clarity. Our findings reveal that state-of-the-art code models are highly sensitive to prompt quality, providing novel insights and empirical grounding for robustness-aware modeling and robust training methodologies.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have demonstrated impressive performance in code generation tasks under idealized conditions, where task descriptions are clear and precise. However, in practice, task descriptions frequently exhibit ambiguity, incompleteness, or internal contradictions. In this paper, we present the first empirical study examining the robustness of state-of-the-art code generation models when faced with such unclear task descriptions. We extend the HumanEval and MBPP benchmarks by systematically introducing realistic task descriptions flaws through guided mutation strategies, producing a dataset that mirrors the messiness of informal developer instructions. We evaluate multiple LLMs of varying sizes and architectures, analyzing their functional correctness and failure modes across task descriptions categories. Our findings reveal that even minor imperfections in task description phrasing can cause significant performance degradation, with contradictory task descriptions resulting in numerous logical errors. Moreover, while larger models tend to be more resilient than smaller variants, they are not immune to the challenges posed by unclear requirements. We further analyze semantic error patterns and identify correlations between description clarity, model behavior, and error types. Our results underscore the critical need for developing LLMs that are not only powerful but also robust to the imperfections inherent in natural user tasks, highlighting important considerations for improving model training strategies, designing more realistic evaluation benchmarks, and ensuring reliable deployment in practical software development environments.

Problem

Research questions and friction points this paper is trying to address.

Evaluating code model robustness to ambiguous task descriptions

Assessing performance degradation from incomplete or contradictory prompts

Analyzing error patterns in LLMs under unclear requirements

Innovation

Methods, ideas, or system contributions that make the work stand out.

Extend benchmarks with flawed task descriptions

Evaluate models on ambiguous contradictory incomplete tasks

Analyze error patterns and model resilience

🔎 Similar Papers

Assessing Programming Task Difficulty for Efficient Evaluation of Large Language Models