🤖 AI Summary
Existing code selection methods rely on large language models (LLMs) to independently assess individual programs, rendering them vulnerable to LLM output errors or misjudgments of functional equivalence. This work proposes a pairwise query framework grounded in exact learning, introducing two novel query types—pairwise membership queries and pairwise equivalence queries—to construct a robust tournament-based selection algorithm. Unlike prior approaches, it does not assume LLM reliability; instead, it enhances selection accuracy through adversarial pairwise program comparisons. Evaluated on four mainstream code generation benchmarks, the method achieves an average 13.0% absolute improvement in pass@1, with gains up to 27.1%. On complex reasoning tasks, it boosts success rate by 24.0%, substantially outperforming state-of-the-art methods. The framework thus provides a principled, robust alternative to independent LLM judgments for program selection.
📝 Abstract
Despite recent advances in LLMs, the task of code generation is still challenging. To cope, code selection algorithms select the best program from multiple programs generated by an LLM. However, existing algorithms can fail to identify the correct program, either because they can misidentify nonequivalent programs or because they rely on an LLM and assume it always correctly determines the output for every input. We present ExPairT-LLM, an exact learning algorithm for code selection that selects a program by posing to an LLM oracle two new types of queries: pairwise membership and pairwise equivalence. These queries are simpler for LLMs and enable ExPairT-LLM to identify the correct program through a tournament, which is robust to some LLM mistakes. We evaluate ExPairT-LLM on four popular code datasets. Its pass@1 (success rate) outperforms the state-of-the-art code selection algorithm on average by +13.0% and up to +27.1%. It also improves the pass@1 of LLMs performing complex reasoning by +24.0%.