ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation

๐Ÿ“… 2026-04-04
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge that existing large language modelโ€“based methods for generating test cases often produce potentially flawed tests, making test quality difficult to assess and creating a circular dependency between test validity and code correctness. To resolve this, the authors propose the ACES scoring framework, which reframes test evaluation as a ranking consistency problem. ACES introduces leave-one-out AUC (LOO-AUC) to measure a testโ€™s ability to discriminate between correct and incorrect code without requiring absolute judgments of test correctness. Operating on a binary pass-fail matrix, ACES enables efficient weighted ranking via either closed-form weight computation or differentiable optimization, yielding two scalable variants: ACES-C and ACES-O. Experiments demonstrate that ACES significantly improves Pass@$k$ performance across multiple code generation benchmarks, achieving state-of-the-art results with minimal computational overhead.
๐Ÿ“ Abstract
Selecting LLM-generated code candidates using LLM-generated tests is challenging because the tests themselves may be incorrect. Existing methods either treat all tests equally or rely on ad-hoc heuristics to filter unreliable tests. Yet determining test correctness requires knowing which codes are correct, creating a \emph{circular dependency}. Our key insight is that we need not determine test correctness at all: \emph{test votes should rank, not merely count}. What matters is not how many codes pass a test, but whether the test can \emph{distinguish} correct from incorrect code. We break the circular dependency via leave-one-out evaluation: hold out one test, rank codes by their aggregate scores on all remaining tests, and measure whether the held-out test's pass/fail pattern agrees with this ranking. We formalize this agreement as the leave-one-out AUC~(LOO-AUC) and prove that the expected LOO-AUC is proportional to each test's ability to separate correct code from incorrect code. Building on this, we propose \textbf{ACES}~(\textbf{A}UC \textbf{C}onsist\textbf{E}ncy \textbf{S}coring) with two complementary variants: ACES-C provides closed-form weights that provably approximate the oracle in expectation under a mild assumption on average test quality; ACES-O drops this assumption and iteratively optimizes a differentiable LOO-AUC objective. Both operate solely on the binary pass matrix with negligible overhead, and achieve state-of-the-art Pass@$k$ on multiple code generation benchmarks.
Problem

Research questions and friction points this paper is trying to address.

code generation
test reliability
circular dependency
LLM-generated tests
code evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

LOO-AUC
test weighting
code generation
circular dependency
ACES
๐Ÿ”Ž Similar Papers
No similar papers found.
Hui Sun
Hui Sun
Nanjing University, State Key Laboratory for Novel Software Technology, China
Deep LearningTransfer LearningDomain AdaptationSemi-supervised Learning
Y
Yun-Ji Zhang
National Key Laboratory for Novel Software Technology, Nanjing University, China; School of Artificial Intelligence, Nanjing University, China
Zheng Xie
Zheng Xie
Nanjing University
Machine Learning
R
Ren-Biao Liu
National Key Laboratory for Novel Software Technology, Nanjing University, China; School of Artificial Intelligence, Nanjing University, China
Yali Du
Yali Du
Turing Fellow, Associate professor, King's College London
Multi-Agent Reinforcement LearningHuman-ai coordinationAlignmentCooperative AI
X
Xin-Ye Li
National Key Laboratory for Novel Software Technology, Nanjing University, China; School of Artificial Intelligence, Nanjing University, China
Ming Li
Ming Li
Nanjing University
Machine LearningData Mining