Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis

📅 2026-04-27

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

This work addresses the critical dependency of large language models (LLMs) on the completeness of task specifications in code generation, where user-provided descriptions often contain defects that severely degrade output quality. The authors propose SpecValidator, a lightweight classifier based on parameter-efficient fine-tuning, which systematically identifies and categorizes three core types of specification defects: lexical ambiguity, insufficient detail, and syntactic formatting errors. Notably, SpecValidator operates independently of large models and demonstrates strong generalization to unseen defect patterns. Experimental results show that it achieves F1 and Matthews Correlation Coefficient (MCC) scores of 0.804 and 0.745, respectively, significantly outperforming GPT-5-mini and Claude Sonnet 4. The findings further reveal that the robustness of code generation hinges more on the structural clarity of task descriptions than on model scale.

Technology Category

Application Category

📝 Abstract

Large language models are widely used for code generation, yet they rely on an implicit assumption that the task descriptions are sufficiently detailed and well-formed. However, in practice, users may provide defective descriptions, which can have a strong effect on code correctness. To address this issue, we develop SpecValidator, a lightweight classifier based on a small model that has been parameter-efficiently finetuned, to automatically detect task description defects. We evaluate SpecValidator on three types of defects, Lexical Vagueness, Under-Specification and Syntax-Formatting on 3 benchmarks with task descriptions of varying structure and complexity. Our results show that SpecValidator achieves defect detection of F1 = 0.804 and MCC = 0.745, significantly outperforming GPT-5-mini (F1 = 0.469 and MCC = 0.281) and Claude Sonnet 4 (F1 = 0.518 and MCC = 0.359). Perhaps more importantly, our analysis indicates that SpecValidator can generalize to unseen issues and detect unknown Under-Specification defects in the original (real) descriptions of the benchmarks used. Our results also show that the robustness of LLMs in task description defects depends primarily on the type of defect and the characteristics of the task description, rather than the capacity of the model, with Under-Specification defects being the most severe. We further found that benchmarks with richer contextual grounding, such as LiveCodeBench, exhibit substantially greater resilience, highlighting the importance of structured task descriptions for reliable LLM-based code generation.

Problem

Research questions and friction points this paper is trying to address.

Defective Task Descriptions

Code Generation

Large Language Models

Under-Specification

Lexical Vagueness

Innovation

Methods, ideas, or system contributions that make the work stand out.

SpecValidator

defective task description

parameter-efficient fine-tuning