🤖 AI Summary
Existing reinforcement learning approaches for code generation are constrained by static, low-coverage test suites and suffer from self-collusion or poor generalization when relying on self-generated tests. This work proposes an adversarial co-evolution framework that jointly optimizes large language models for code generation and test generation: the former aims to pass tests, while the latter seeks to expose defects. By employing a decoupled architecture in adversarial training, the framework mitigates self-collusion and enables dynamic, high-quality interaction through white-box-accessible targeted test generation, error-aware experience replay, and a composite reward design. Experiments on Qwen2.5-Coder demonstrate that the method matches or even surpasses models supervised by human-written tests in code generation performance, while substantially enhancing test generation capability.
📝 Abstract
Reinforcement learning for code generation relies on verifiable rewards from unit test pass rates. Yet high-quality test suites are scarce, existing datasets offer limited coverage, and static rewards fail to adapt as models improve. Recent self-play methods unify code and test generation in a single model, but face a inherent dilemma: white-box access leads to self-collusion where the model produces trivial tests for easy rewards, yet black-box restriction yields generic tests that miss implementation-specific bugs. We introduce Code-A1, an adversarial co-evolution framework that jointly optimizes a Code LLM and a Test LLM with opposing objectives. The Code LLM is rewarded for passing more tests, while the Test LLM is rewarded for exposing more defects. This architectural separation eliminates self-collusion risks and safely enables white-box test generation, where the Test LLM can inspect candidate code to craft targeted adversarial tests. We further introduce a Mistake Book mechanism for experience replay and a composite reward balancing test validity with adversarial difficulty. Experiments on Qwen2.5-Coder models demonstrate that Code-A1 achieves code generation performance matching or exceeding models trained on human-annotated tests, while significantly improving test generation capability.