Efficient Process Reward Model Training via Active Learning

📅 2025-04-14

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

To address the high annotation cost—whether human- or LLM-based—in training Process Reward Models (PRMs), this paper proposes ActPRM, an active learning framework for PRM training. Methodologically, ActPRM introduces the first active learning paradigm tailored to PRMs, featuring a forward-inference-based trajectory uncertainty estimation mechanism that selectively invokes strong reasoning models only for highly uncertain trajectories; it further establishes a lightweight-PRM–strong-model collaborative annotation scheme for process-level reward modeling. Experimentally, ActPRM achieves state-of-the-art performance on ProcessBench (75.0%) and PRMBench (65.5%), matching or surpassing full fine-tuning with only 50% of the annotations. This efficiency enables scalable, million-scale trajectory filtering, significantly reducing annotation overhead while preserving reward modeling fidelity.

Technology Category

Application Category

📝 Abstract

Process Reward Models (PRMs) provide step-level supervision to large language models (LLMs), but scaling up training data annotation remains challenging for both humans and LLMs. To address this limitation, we propose an active learning approach, ActPRM, which proactively selects the most uncertain samples for training, substantially reducing labeling costs. During training, we use the PRM to estimate uncertainty after the forward pass, retaining only highly uncertain data. A capable yet costly reasoning model then labels this data. Then we compute the loss with respect to the labels and update the PRM's weights. We compare ActPRM vs. vanilla fine-tuning, on a pool-based active learning setting, demonstrating that ActPRM reduces 50% annotation, but achieving the comparable or even better performance. Beyond annotation efficiency, we further advance the actively trained PRM by filtering over 1M+ math reasoning trajectories with ActPRM, retaining 60% of the data. A subsequent training on this selected dataset yields a new state-of-the-art (SOTA) PRM on ProcessBench (75.0%) and PRMBench (65.5%) compared with same sized models.

Problem

Research questions and friction points this paper is trying to address.

Reducing annotation costs for Process Reward Models (PRMs) via active learning

Improving PRM training efficiency by selecting uncertain samples

Achieving state-of-the-art PRM performance with filtered math reasoning data

Innovation

Methods, ideas, or system contributions that make the work stand out.

Active learning reduces labeling costs

Uncertainty-based data selection improves efficiency

Filtered data achieves state-of-the-art performance

🔎 Similar Papers

An MRP Formulation for Supervised Learning: Generalized Temporal Difference Learning Models