🤖 AI Summary
Detecting subtle bugs in LLM-generated code remains challenging, and high-quality test cases—especially for difficult programming problems—are scarce. Method: We propose HARDTESTGEN, the first multi-stage, LLM-driven test synthesis pipeline tailored for hard programming problems. It integrates problem understanding, boundary-scenario mining, counterexample-driven verification, and quality-aware filtering. Using this method, we construct HARDTESTS, a competitive programming test suite comprising 47K problems. Contribution/Results: Experiments show that HARDTESTGEN improves test accuracy by 11.3 percentage points and recall by 17.5 percentage points; on hard problems, accuracy gains reach up to 40 percentage points. In reinforcement learning–based code generation training, HARDTESTGEN significantly outperforms both human-written and existing automated test generation methods, leading to substantial improvements in downstream code generation performance.
📝 Abstract
Verifiers play a crucial role in large language model (LLM) reasoning, needed by post-training techniques such as reinforcement learning. However, reliable verifiers are hard to get for difficult coding problems, because a well-disguised wrong solution may only be detected by carefully human-written edge cases that are difficult to synthesize. To address this issue, we propose HARDTESTGEN, a pipeline for high-quality test synthesis using LLMs. With this pipeline, we curate a comprehensive competitive programming dataset HARDTESTS with 47k problems and synthetic high-quality tests. Compared with existing tests, HARDTESTGEN tests demonstrate precision that is 11.3 percentage points higher and recall that is 17.5 percentage points higher when evaluating LLM-generated code. For harder problems, the improvement in precision can be as large as 40 points. HARDTESTS also proves to be more effective for model training, measured by downstream code generation performance. We will open-source our dataset and synthesis pipeline at https://leililab.github.io/HardTests/.