DreamPRM-Code: Function-as-Step Process Reward Model with Label Correction for LLM Coding

📅 2025-12-16

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing process reward models (PRMs) for code generation suffer from unnatural step decomposition and noisy Monte Carlo intermediate labels. To address these issues, we propose a novel PRM paradigm that treats function calls as atomic reasoning units and employs Chain-of-Function prompting to enable modular, interpretable code generation. We further introduce a two-level meta-learning label correction mechanism: leveraging final-state unit test outcomes to retroactively refine intermediate-step rewards, thereby mitigating label noise. Our approach integrates PRM-based reward modeling, MAML-inspired bi-level optimization, and test-time scaling. Evaluated on LiveCodeBench, our method achieves 80.9% pass@1—surpassing OpenAI’s o4-mini and establishing a new state-of-the-art for PRMs in code generation.

Technology Category

Application Category

📝 Abstract

Process Reward Models (PRMs) have become essential for improving Large Language Models (LLMs) via test-time scaling, yet their effectiveness in coding remains limited due to the lack of meaningful step decompositions in code and the noise of Monte-Carlo-generated partial labels. We propose DreamPRM-Code, a coding-focused PRM that treats functions as reasoning steps using a Chain-of-Function prompting strategy to induce modular code generation, enabling PRM training and application analogous to mathematical reasoning tasks. To address label noise, DreamPRM-Code introduces a meta-learning-based correction mechanism that leverages clean final-solution unit-test labels and performs bi-level optimization to refine intermediate labels. Applying on test-time scaling, DreamPRM-Code achieved state-of-the-art performance on LiveCodeBench with 80.9 pass@1 rate, surpassing OpenAI o4-mini.

Problem

Research questions and friction points this paper is trying to address.

Improves LLM coding via function-as-step reward modeling

Corrects noisy partial labels using meta-learning optimization

Achieves state-of-the-art code generation performance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Function prompting for modular code steps

Meta-learning label correction with bi-level optimization

Process reward model training analogous to math reasoning

🔎 Similar Papers

No similar papers found.

Authors to Follow