🤖 AI Summary
To address the high assertion error rate (>62%) and low practicality of assertions generated by large language models (LLMs), this paper proposes a novel postcondition-driven assertion verification paradigm that requires neither human annotation nor execution of the target function. Methodologically, it leverages a small set of input-output examples to guide the LLM in generating reliable postconditions, then identifies logically flawed assertions via assertion–postcondition consistency checking. The core contribution is the first formalization of LLM-generated postconditions as verifiable correctness criteria for assertions. Experiments demonstrate that the approach detects over 64% of erroneous assertions on average—63.0% on GPT-3.5 and 65.5% on GPT-4—improves Pass@1 code generation performance by 4 percentage points, and preserves the validity of correct assertions with negligible degradation.
📝 Abstract
Recently, given the docstring for the target problem and the target function signature, large language models (LLMs) have been used not only to generate source code, but also to generate test cases, consisting of test inputs and assertions (e.g., in the form of checking an actual output against the expected output). However, as shown by our empirical study on assertions generated by four LLMs for the HumanEval benchmark, over 62% of the generated assertions are incorrect (i.e., failed on the ground-truth problem solution). To detect incorrect assertions (given the docstring and the target function signature along with a sample of example inputs and outputs), in this paper, we propose a new approach named DeCon to effectively detect incorrect assertions via LLM-generated postconditions for the target problem (a postcondition is a predicate that must always be true just after the execution of the ground-truth problem solution). Our approach requires a small set of I/O examples (i.e., a sample of example inputs and outputs) for the target problem (e.g., the I/O examples included in the docstring for a target problem in HumanEval). We use the given I/O examples to filter out those LLM-generated postconditions that are violated by at least one given I/O example. We then use the remaining postconditions to detect incorrect assertions as those assertions that violate at least one remaining postcondition. Experimental results show that DeCon can detect averagely more than 64% (63% and 65.5% detected by GPT-3.5 and GPT-4, respectively) incorrect assertions generated by four state-of-the-art LLMs, and DeCon can also improve the effectiveness of these LLMs in code generation by 4% in terms of Pass@1. In addition, although DeCon might filter out correct assertions, the fault-finding ability of the remaining correct assertions decreases only slightly.