Reward Models Enable Scalable Code Verification by Trading Accuracy for Throughput

📅 2025-06-11

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This work addresses the speed–accuracy trade-off in code generation verification for large language models (LLMs). We propose a “generate–prune–rank” framework: leveraging a lightweight outcome reward model (ORM) as a coarse-grained verifier to dynamically prune high-confidence incorrect solutions, followed by ranking of remaining candidates. To our knowledge, this is the first systematic study establishing the critical utility of ORMs for code verification—challenging the prevailing paradigm that prioritizes full test-suite execution. Empirical evaluation shows that our approach achieves an 11.65× throughput speedup over exhaustive testing while incurring only an 8.33% accuracy drop, significantly enhancing scalability and practicality for large-scale code verification. The core contribution lies in demonstrating that ORM-guided pruning effectively filters erroneous candidates, offering a new, efficient, and reliable pathway for LLM-based code verification.

Technology Category

Application Category

📝 Abstract

The standard paradigm for solving coding tasks via large language models (LLMs) is to generate-then-rank programs, where the latter step uses a verifier in the ranking process. The growing consensus is that a comprehensive verifier (e.g., a full test suite) should be prioritized over an outcome reward model (ORM) whenever possible, with little consideration given to the trade-offs involved. We aim to challenge this assumption by systematically exploring the tradeoff between speed and accuracy. We find that ORMs play a crucial role in scaling verification through trading accuracy for speed, even when a comprehensive verifier is available. Their value becomes especially apparent when used in a generate-prune-then-rank approach, where a faster but less accurate verifier removes incorrect solutions prior to ranking -- leading to a system that is 11.65x faster while only being 8.33% less accurate than the full test suite. We analyze the generate-prune-then-rank approach and show that it works by filtering out incorrect but highly ranked solutions. These findings enable the design of scalable and accurate program ranking systems.

Problem

Research questions and friction points this paper is trying to address.

Exploring speed-accuracy tradeoff in code verification

Evaluating reward models versus comprehensive verifiers

Optimizing generate-prune-then-rank for scalable verification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses outcome reward models for verification

Trades accuracy for speed in verification

Implements generate-prune-then-rank approach

🔎 Similar Papers

Code-Optimise: Self-Generated Preference Data for Correctness and Efficiency