DreamPRM-Code: Function-as-Step Process Reward Model with Label Correction for LLM Coding

📅 2025-12-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing process reward models (PRMs) for code generation suffer from unnatural step decomposition and noisy Monte Carlo intermediate labels. To address these issues, we propose a novel PRM paradigm that treats function calls as atomic reasoning units and employs Chain-of-Function prompting to enable modular, interpretable code generation. We further introduce a two-level meta-learning label correction mechanism: leveraging final-state unit test outcomes to retroactively refine intermediate-step rewards, thereby mitigating label noise. Our approach integrates PRM-based reward modeling, MAML-inspired bi-level optimization, and test-time scaling. Evaluated on LiveCodeBench, our method achieves 80.9% pass@1—surpassing OpenAI’s o4-mini and establishing a new state-of-the-art for PRMs in code generation.

Technology Category

Application Category

📝 Abstract
Process Reward Models (PRMs) have become essential for improving Large Language Models (LLMs) via test-time scaling, yet their effectiveness in coding remains limited due to the lack of meaningful step decompositions in code and the noise of Monte-Carlo-generated partial labels. We propose DreamPRM-Code, a coding-focused PRM that treats functions as reasoning steps using a Chain-of-Function prompting strategy to induce modular code generation, enabling PRM training and application analogous to mathematical reasoning tasks. To address label noise, DreamPRM-Code introduces a meta-learning-based correction mechanism that leverages clean final-solution unit-test labels and performs bi-level optimization to refine intermediate labels. Applying on test-time scaling, DreamPRM-Code achieved state-of-the-art performance on LiveCodeBench with 80.9 pass@1 rate, surpassing OpenAI o4-mini.
Problem

Research questions and friction points this paper is trying to address.

Improves LLM coding via function-as-step reward modeling
Corrects noisy partial labels using meta-learning optimization
Achieves state-of-the-art code generation performance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Chain-of-Function prompting for modular code steps
Meta-learning label correction with bi-level optimization
Process reward model training analogous to math reasoning
🔎 Similar Papers
No similar papers found.
R
Ruiyi Zhang
University of California, San Diego
P
Peijia Qin
University of California, San Diego
Q
Qi Cao
University of California, San Diego
Pengtao Xie
Pengtao Xie
Associate Professor, UC San Diego; Adjunct Faculty, MBZUAI
Machine Learning