🤖 AI Summary
False negatives in software testing undermine the reliability of code correctness assessment. Method: We propose error-inducing test cases to enhance evaluation robustness and introduce Codehacks—a large-scale, real-world adversarial benchmark systematically constructed from Codeforces. It comprises 5,578 programming problems, 288,617 vulnerability-triggering “hack” test cases, and 2,196 compromised submissions. Data collection leverages the official API and web crawling, augmented by natural language parsing and code–test pair alignment annotation. Contribution/Results: Codehacks is the first large-scale, real-world adversarial benchmark designed specifically for evaluating the robustness of program synthesis and verification models. It fills a critical gap in high-quality empirical adversarial data for assessing LLM-based code generation, significantly improving the credibility and practicality of robustness evaluation.
📝 Abstract
Software is used in critical applications in our day-to-day life and it is important to ensure its correctness. One popular approach to assess correctness is to evaluate software on tests. If a test fails, it indicates a fault in the software under test; if all tests pass correctly, one may assume that the software is correct. However, the reliability of these results depends on the test suite considered, and there is a risk of false negatives (i.e. software that passes all available tests but contains bugs because some cases are not tested). Therefore, it is important to consider error-inducing test cases when evaluating software. To support data-driven creation of such a test-suite, which is especially of interest for testing software synthesized from large language models, we curate a dataset (Codehacks) of programming problems together with corresponding error-inducing test cases (i.e.,"hacks"). This dataset is collected from the wild, in particular, from the Codeforces online judge platform. The dataset comprises 288,617 hacks for 5,578 programming problems, each with a natural language description, as well as the source code for 2,196 submitted solutions to these problems that can be broken with their corresponding hacks. Keywords: competitive programming, language model, dataset