Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation

📅 2025-02-20

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Large language models (LLMs) suffer from error accumulation and degraded generalization in code and test generation due to reliance on low-quality synthetic data. Method: This paper proposes the Solver-Verifier self-play framework, wherein a single LLM—based on Llama 3.1 8B—serves concurrently as a solver (generating code) and a verifier (automatically generating and executing test cases), establishing a closed-loop iterative optimization pipeline. It introduces a novel self-verification paradigm that requires no human annotations or external teacher models, integrating dual-role prompt engineering, self-critical test generation, joint fine-tuning, and iterative distillation. Results: On MBPP and LiveCodeBench, the framework achieves average relative improvements of 19.63% in code generation and 17.49% in test generation, significantly alleviating the synthetic data quality bottleneck.

Technology Category

Application Category

📝 Abstract

Recent advances in large language models (LLMs) have improved their performance on coding benchmarks. However, improvement is plateauing due to the exhaustion of readily available high-quality data. Prior work has shown the potential of synthetic self-instruct data, but naively training on a model's own outputs can cause error accumulation, especially in coding tasks, where generalization may collapse due to overly simple or erroneous training data, highlighting the need for rigorous quality checks on synthetic data. In this work, we explore an effective approach whereby the model itself verifies the correctness of its own data. We thus propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity. By iteratively refining code (LLM-as-a-solver) and tests (LLM-as-a-verifier) together, we boost both capabilities without relying on human annotations or larger teacher models. Experiments with the Llama 3.1 8B model demonstrate substantial performance enhancements, achieving average relative improvements of 19.63% in code generation and 17.49% in test generation on MBPP and LiveCodeBench.

Problem

Research questions and friction points this paper is trying to address.

Self-play framework enhances code generation

Model verifies correctness of synthetic data

Improves code and test generation without human input

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-play framework for code

Model verifies its own data

Iterative code and test refinement

🔎 Similar Papers

No similar papers found.