ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation

๐Ÿ“… 2026-04-04
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

160K/year
๐Ÿค– AI Summary
This work addresses the challenge that existing large language modelโ€“based methods for generating test cases often produce potentially flawed tests, making test quality difficult to assess and creating a circular dependency between test validity and code correctness. To resolve this, the authors propose the ACES scoring framework, which reframes test evaluation as a ranking consistency problem. ACES introduces leave-one-out AUC (LOO-AUC) to measure a testโ€™s ability to discriminate between correct and incorrect code without requiring absolute judgments of test correctness. Operating on a binary pass-fail matrix, ACES enables efficient weighted ranking via either closed-form weight computation or differentiable optimization, yielding two scalable variants: ACES-C and ACES-O. Experiments demonstrate that ACES significantly improves Pass@$k$ performance across multiple code generation benchmarks, achieving state-of-the-art results with minimal computational overhead.

Technology Category

Application Category

๐Ÿ“ Abstract
Selecting LLM-generated code candidates using LLM-generated tests is challenging because the tests themselves may be incorrect. Existing methods either treat all tests equally or rely on ad-hoc heuristics to filter unreliable tests. Yet determining test correctness requires knowing which codes are correct, creating a \emph{circular dependency}. Our key insight is that we need not determine test correctness at all: \emph{test votes should rank, not merely count}. What matters is not how many codes pass a test, but whether the test can \emph{distinguish} correct from incorrect code. We break the circular dependency via leave-one-out evaluation: hold out one test, rank codes by their aggregate scores on all remaining tests, and measure whether the held-out test's pass/fail pattern agrees with this ranking. We formalize this agreement as the leave-one-out AUC~(LOO-AUC) and prove that the expected LOO-AUC is proportional to each test's ability to separate correct code from incorrect code. Building on this, we propose \textbf{ACES}~(\textbf{A}UC \textbf{C}onsist\textbf{E}ncy \textbf{S}coring) with two complementary variants: ACES-C provides closed-form weights that provably approximate the oracle in expectation under a mild assumption on average test quality; ACES-O drops this assumption and iteratively optimizes a differentiable LOO-AUC objective. Both operate solely on the binary pass matrix with negligible overhead, and achieve state-of-the-art Pass@$k$ on multiple code generation benchmarks.
Problem

Research questions and friction points this paper is trying to address.

code generation
test reliability
circular dependency
LLM-generated tests
code evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

LOO-AUC
test weighting
code generation
circular dependency
ACES
๐Ÿ”Ž Similar Papers
Hui Sun
Hui Sun
Nanjing University, State Key Laboratory for Novel Software Technology, China
Deep LearningTransfer LearningDomain AdaptationSemi-supervised Learning
Y
Yun-Ji Zhang
National Key Laboratory for Novel Software Technology, Nanjing University, China; School of Artificial Intelligence, Nanjing University, China
Zheng Xie
Zheng Xie
Nanjing University
Machine Learning
R
Ren-Biao Liu
National Key Laboratory for Novel Software Technology, Nanjing University, China; School of Artificial Intelligence, Nanjing University, China
Yali Du
Yali Du
Turing Fellow, Associate professor, King's College London
Multi-Agent Reinforcement LearningHuman-ai coordinationAlignmentCooperative AI
X
Xin-Ye Li
National Key Laboratory for Novel Software Technology, Nanjing University, China; School of Artificial Intelligence, Nanjing University, China
Ming Li
Ming Li
Nanjing University
Machine LearningData Mining