Codehacks: A Dataset of Adversarial Tests for Competitive Programming Problems Obtained from Codeforces

📅 2025-03-30

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

False negatives in software testing undermine the reliability of code correctness assessment. Method: We propose error-inducing test cases to enhance evaluation robustness and introduce Codehacks—a large-scale, real-world adversarial benchmark systematically constructed from Codeforces. It comprises 5,578 programming problems, 288,617 vulnerability-triggering “hack” test cases, and 2,196 compromised submissions. Data collection leverages the official API and web crawling, augmented by natural language parsing and code–test pair alignment annotation. Contribution/Results: Codehacks is the first large-scale, real-world adversarial benchmark designed specifically for evaluating the robustness of program synthesis and verification models. It fills a critical gap in high-quality empirical adversarial data for assessing LLM-based code generation, significantly improving the credibility and practicality of robustness evaluation.

Technology Category

Application Category

📝 Abstract

Software is used in critical applications in our day-to-day life and it is important to ensure its correctness. One popular approach to assess correctness is to evaluate software on tests. If a test fails, it indicates a fault in the software under test; if all tests pass correctly, one may assume that the software is correct. However, the reliability of these results depends on the test suite considered, and there is a risk of false negatives (i.e. software that passes all available tests but contains bugs because some cases are not tested). Therefore, it is important to consider error-inducing test cases when evaluating software. To support data-driven creation of such a test-suite, which is especially of interest for testing software synthesized from large language models, we curate a dataset (Codehacks) of programming problems together with corresponding error-inducing test cases (i.e.,"hacks"). This dataset is collected from the wild, in particular, from the Codeforces online judge platform. The dataset comprises 288,617 hacks for 5,578 programming problems, each with a natural language description, as well as the source code for 2,196 submitted solutions to these problems that can be broken with their corresponding hacks. Keywords: competitive programming, language model, dataset

Problem

Research questions and friction points this paper is trying to address.

Assessing software correctness via error-inducing test cases

Creating dataset for testing software from language models

Collecting adversarial tests from competitive programming platforms

Innovation

Methods, ideas, or system contributions that make the work stand out.

Curates dataset of error-inducing test cases

Collects hacks from Codeforces platform

Includes natural language descriptions and source code

🔎 Similar Papers

No similar papers found.