🤖 AI Summary
This work addresses the prevalent issue of spurious correctness in large language models for code generation, where models appear correct on static public test cases yet fail to generalize to hidden tests. To overcome this limitation, the authors propose an Adversarial Monte Carlo Tree Search (Adversarial MCTS) framework that formulates code generation as a minimax game between a solver and an attacker: the solver produces candidate programs, while the attacker dynamically constructs targeted boundary test cases to expose logical flaws, thereby creating an increasingly stringent verification environment. This approach uniquely integrates adversarial test generation with MCTS, enabling active discovery of code vulnerabilities through dynamically evolving tests and breaking the overfitting bottleneck inherent in static validation. Experimental results demonstrate that the method substantially outperforms existing techniques, significantly reducing spurious correctness rates and enhancing model generalization and robustness on unseen test cases.
📝 Abstract
Recent advancements in Large Language Models (LLMs) have successfully employed search-based strategies to enhance code generation. However, existing methods typically rely on static, sparse public test cases for verification, leading to pseudo-correctness -- where solutions overfit the visible public tests but fail to generalize to hidden test cases. We argue that optimizing against a fixed, weak environment inherently limits robustness. To address this, we propose AdverMCTS, a novel adversarial Monte Carlo Tree Search framework that combats pseudo-correctness by coupling code search with active vulnerability discovery. AdverMCTS formulates generation as a minimax-style game between a Solver agent, which synthesizes code candidates, and an Attacker agent, which evolves to generate targeted corner test cases that exploit logical divergences in the current code pool. These discovered tests form a dynamic, progressively hostile filter that penalizes fragile reasoning. Extensive experiments demonstrate that AdverMCTS significantly outperforms state-of-the-art baselines, effectively reducing false positive rates and forcing the model to generalize beyond the initial constraints. The resources of this work are available at https://anonymous.4open.science/r/AdverMCTS_open-A255.