🤖 AI Summary
Existing process reward models (PRMs) for code generation suffer from unnatural step decomposition and noisy Monte Carlo intermediate labels. To address these issues, we propose a novel PRM paradigm that treats function calls as atomic reasoning units and employs Chain-of-Function prompting to enable modular, interpretable code generation. We further introduce a two-level meta-learning label correction mechanism: leveraging final-state unit test outcomes to retroactively refine intermediate-step rewards, thereby mitigating label noise. Our approach integrates PRM-based reward modeling, MAML-inspired bi-level optimization, and test-time scaling. Evaluated on LiveCodeBench, our method achieves 80.9% pass@1—surpassing OpenAI’s o4-mini and establishing a new state-of-the-art for PRMs in code generation.
📝 Abstract
Process Reward Models (PRMs) have become essential for improving Large Language Models (LLMs) via test-time scaling, yet their effectiveness in coding remains limited due to the lack of meaningful step decompositions in code and the noise of Monte-Carlo-generated partial labels. We propose DreamPRM-Code, a coding-focused PRM that treats functions as reasoning steps using a Chain-of-Function prompting strategy to induce modular code generation, enabling PRM training and application analogous to mathematical reasoning tasks. To address label noise, DreamPRM-Code introduces a meta-learning-based correction mechanism that leverages clean final-solution unit-test labels and performs bi-level optimization to refine intermediate labels. Applying on test-time scaling, DreamPRM-Code achieved state-of-the-art performance on LiveCodeBench with 80.9 pass@1 rate, surpassing OpenAI o4-mini.