🤖 AI Summary
Existing LLM code-generation testing methods rely on static datasets, imposing a “fixed difficulty ceiling” and failing to uncover complex out-of-distribution defects.
Method: We propose a dynamic test-generation framework based on adversarial reinforcement learning, wherein a test generator and a malicious code generator engage in a competitive game. This enables online evolution of test difficulty curricula and self-amplifying test capability. The framework jointly optimizes two objectives: code correctness and attack success rate, thereby transcending constraints imposed by static data.
Contribution/Results: Experiments demonstrate that our approach significantly outperforms state-of-the-art baselines on both Best-of-N filtering and reward modeling tasks. To the best of our knowledge, this is the first method to achieve continuous evolution of test strategies and measurable improvement in generalization capability—marking a fundamental advance in automated, adaptive LLM code-testing.
📝 Abstract
Large Language Models (LLMs) excel at code generation, yet their outputs often contain subtle bugs, for which effective test cases are a critical bottleneck. Existing test generation methods, whether based on prompting or supervised fine-tuning, rely on static datasets. This imposes a ``fixed-difficulty ceiling'', fundamentally limiting their ability to uncover novel or more complex bugs beyond their training scope. To overcome this, we introduce ATGen, a framework that trains a test case generator via adversarial reinforcement learning. ATGen pits a test generator against an adversarial code generator that continuously crafts harder bugs to evade the current policy. This dynamic loop creates a curriculum of increasing difficulty challenging current policy. The test generator is optimized via Reinforcement Learning (RL) to jointly maximize ``Output Accuracy'' and ``Attack Success'', enabling it to learn a progressively stronger policy that breaks the fixed-difficulty ceiling of static training. Extensive experiments demonstrate that ATGen significantly outperforms state-of-the-art baselines. We further validate its practical utility, showing it serves as both a more effective filter for Best-of-N inference and a higher-quality reward source for training code generation models. Our work establishes a new, dynamic paradigm for improving the reliability of LLM-generated code.