🤖 AI Summary
This study addresses the systematic misjudgments of large language models (LLMs) in assessing whether code implementations satisfy natural language requirements, particularly their tendency to erroneously flag correct implementations as non-compliant. Through a unified prompting template and evaluation on mainstream benchmarks, the work systematically examines LLM reliability in requirement-consistency verification and reveals that prompts requesting explanations and corrections exacerbate over-correction behavior. To mitigate this, the authors propose a fix-guided verification filter that leverages model-generated repairs as counterfactual evidence, integrated with specification-aware test augmentation. Experimental results demonstrate that this approach significantly improves judgment accuracy, offering a practical pathway toward reliable deployment of LLMs in safety-critical applications.
📝 Abstract
Large language models (LLMs) have become essential tools in software development, widely used for requirements engineering, code generation and review tasks. Software engineers often rely on LLMs to verify if code implementation satisfy task requirements, thereby ensuring code robustness and accuracy. However, it remains unclear whether LLMs can reliably determine code against the given task descriptions, which is usually in a form of natural language specifications. In this paper, we uncover a systematic failure of LLMs in matching code to natural language requirements. Specifically, with widely adopted benchmarks and unified prompts design, we demonstrate that LLMs frequently misclassify correct code implementation as non-compliant or defective. Surprisingly, we find that more detailed prompt design, particularly with those requiring explanations and proposed corrections, leads to higher misjudgment rates, highlighting critical reliability issues for LLM-based code assistants. We further analyze the mechanisms driving these failures and evaluate the reliability of rationale-required judgments. Building on these findings, we propose a Fix-guided Verification Filter that treats the model proposed fix as executable counterfactual evidence, and validates the original and revised implementations using benchmark tests and spec-constrained augmented tests. Our results expose previously under-explored limitations in LLM-based code review capabilities, and provide practical guidance for integrating LLM-based reviewers with safeguards in automated review and development pipelines.