Verification Limits Code LLM Training

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work identifies a “verification ceiling” problem in large language model (LLM) training for code generation, arising from the limited capability of synthetic validators: overly stringent validation criteria erroneously discard high-quality, diverse synthetic data, thereby hindering model performance gains. To address this, the authors propose a systematic calibration framework for verification strategies—comprising (i) construction of diverse, comprehensive test suites; (ii) relaxation of rigid pass/fail thresholds; and (iii) integration of LLM-driven soft validation to complement formal correctness checks. Crucially, this approach preserves functional correctness while substantially improving synthetic data utilization. Experiments show that test suite diversification yields an average +3.0 point improvement in pass@1; threshold relaxation further boosts performance by +2–4 points. Human evaluation and formal verification jointly confirm that calibrated validation effectively breaks the performance bottleneck, enhancing both generalization and generation diversity.

Technology Category

Application Category

📝 Abstract

Large language models for code generation increasingly rely on synthetic data, where both problem solutions and verification tests are generated by models. While this enables scalable data creation, it introduces a previously unexplored bottleneck: the verification ceiling, in which the quality and diversity of training data are fundamentally constrained by the capabilities of synthetic verifiers. In this work, we systematically study how verification design and strategies influence model performance. We investigate (i) what we verify by analyzing the impact of test complexity and quantity: richer test suites improve code generation capabilities (on average +3 pass@1), while quantity alone yields diminishing returns, (ii) how we verify by exploring relaxed pass thresholds: rigid 100% pass criteria can be overly restrictive. By allowing for relaxed thresholds or incorporating LLM-based soft verification, we can recover valuable training data, leading to a 2-4 point improvement in pass@1 performance. However, this benefit is contingent upon the strength and diversity of the test cases used, and (iii) why verification remains necessary through controlled comparisons of formally correct versus incorrect solutions and human evaluation: retaining diverse correct solutions per problem yields consistent generalization gains. Our results show that Verification as currently practiced is too rigid, filtering out valuable diversity. But it cannot be discarded, only recalibrated. By combining calibrated verification with diverse, challenging problem-solution pairs, we outline a path to break the verification ceiling and unlock stronger code generation models.

Problem

Research questions and friction points this paper is trying to address.

Synthetic verification limits code LLM training data quality and diversity

Rigid 100% pass criteria overly restrict valuable training data

Current verification practices filter out beneficial solution diversity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Relaxed pass thresholds replace rigid 100% criteria

LLM-based soft verification recovers valuable training data

Calibrated verification combined with diverse problem-solution pairs

🔎 Similar Papers

CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification