HardTests: Synthesizing High-Quality Test Cases for LLM Coding

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

166K/year

🤖 AI Summary

Detecting subtle bugs in LLM-generated code remains challenging, and high-quality test cases—especially for difficult programming problems—are scarce. Method: We propose HARDTESTGEN, the first multi-stage, LLM-driven test synthesis pipeline tailored for hard programming problems. It integrates problem understanding, boundary-scenario mining, counterexample-driven verification, and quality-aware filtering. Using this method, we construct HARDTESTS, a competitive programming test suite comprising 47K problems. Contribution/Results: Experiments show that HARDTESTGEN improves test accuracy by 11.3 percentage points and recall by 17.5 percentage points; on hard problems, accuracy gains reach up to 40 percentage points. In reinforcement learning–based code generation training, HARDTESTGEN significantly outperforms both human-written and existing automated test generation methods, leading to substantial improvements in downstream code generation performance.

Technology Category

Application Category

📝 Abstract

Verifiers play a crucial role in large language model (LLM) reasoning, needed by post-training techniques such as reinforcement learning. However, reliable verifiers are hard to get for difficult coding problems, because a well-disguised wrong solution may only be detected by carefully human-written edge cases that are difficult to synthesize. To address this issue, we propose HARDTESTGEN, a pipeline for high-quality test synthesis using LLMs. With this pipeline, we curate a comprehensive competitive programming dataset HARDTESTS with 47k problems and synthetic high-quality tests. Compared with existing tests, HARDTESTGEN tests demonstrate precision that is 11.3 percentage points higher and recall that is 17.5 percentage points higher when evaluating LLM-generated code. For harder problems, the improvement in precision can be as large as 40 points. HARDTESTS also proves to be more effective for model training, measured by downstream code generation performance. We will open-source our dataset and synthesis pipeline at https://leililab.github.io/HardTests/.

Problem

Research questions and friction points this paper is trying to address.

Generating high-quality test cases for LLM coding verifiers

Improving precision and recall in evaluating LLM-generated code

Enhancing model training effectiveness for code generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes HARDTESTGEN for high-quality test synthesis

Uses LLMs to generate precise and recall tests

Curates HARDTESTS dataset with 47k programming problems

🔎 Similar Papers

Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation