DeCon: Detecting Incorrect Assertions via Postconditions Generated by a Large Language Model

📅 2025-01-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high assertion error rate (>62%) and low practicality of assertions generated by large language models (LLMs), this paper proposes a novel postcondition-driven assertion verification paradigm that requires neither human annotation nor execution of the target function. Methodologically, it leverages a small set of input-output examples to guide the LLM in generating reliable postconditions, then identifies logically flawed assertions via assertion–postcondition consistency checking. The core contribution is the first formalization of LLM-generated postconditions as verifiable correctness criteria for assertions. Experiments demonstrate that the approach detects over 64% of erroneous assertions on average—63.0% on GPT-3.5 and 65.5% on GPT-4—improves Pass@1 code generation performance by 4 percentage points, and preserves the validity of correct assertions with negligible degradation.

Technology Category

Application Category

📝 Abstract
Recently, given the docstring for the target problem and the target function signature, large language models (LLMs) have been used not only to generate source code, but also to generate test cases, consisting of test inputs and assertions (e.g., in the form of checking an actual output against the expected output). However, as shown by our empirical study on assertions generated by four LLMs for the HumanEval benchmark, over 62% of the generated assertions are incorrect (i.e., failed on the ground-truth problem solution). To detect incorrect assertions (given the docstring and the target function signature along with a sample of example inputs and outputs), in this paper, we propose a new approach named DeCon to effectively detect incorrect assertions via LLM-generated postconditions for the target problem (a postcondition is a predicate that must always be true just after the execution of the ground-truth problem solution). Our approach requires a small set of I/O examples (i.e., a sample of example inputs and outputs) for the target problem (e.g., the I/O examples included in the docstring for a target problem in HumanEval). We use the given I/O examples to filter out those LLM-generated postconditions that are violated by at least one given I/O example. We then use the remaining postconditions to detect incorrect assertions as those assertions that violate at least one remaining postcondition. Experimental results show that DeCon can detect averagely more than 64% (63% and 65.5% detected by GPT-3.5 and GPT-4, respectively) incorrect assertions generated by four state-of-the-art LLMs, and DeCon can also improve the effectiveness of these LLMs in code generation by 4% in terms of Pass@1. In addition, although DeCon might filter out correct assertions, the fault-finding ability of the remaining correct assertions decreases only slightly.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Test Case Generation
Code Testing
Innovation

Methods, ideas, or system contributions that make the work stand out.

DeCon
Error Detection
Code Generation Accuracy
🔎 Similar Papers
No similar papers found.
H
Hao Yu
Peking University, Beijing, China
T
Tianyu Chen
Peking University, Beijing, China
J
Jiaming Huang
Peking University, Beijing, China
Z
Zongyang Li
Peking University, Beijing, China
Dezhi Ran
Dezhi Ran
School of Computer Science, Peking University
Short Video StreamingSoftware TestingProgram Analysis
X
Xinyu Wang
University of Michigan at Ann Arbor, USA
Y
Ying Li
Peking University, Beijing, China
Assaf Marron
Assaf Marron
Weizmann Institute of Science
Software Engineeringformal methodscomputer scienceprogrammingbiological modeling
David Harel
David Harel
Professor of Computer Science, The Weizmann Institute
computer sciencesystems biology
Y
Yuan Xie
The Hong Kong University of Science and Technology, China
T
Tao Xie
Key Lab of HCST (PKU), MOE; SCS; Peking University, China