Adversarial Training for Process Reward Models

📅 2025-11-28

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

Existing process reward models (PRMs) rely on costly human-provided step-level annotations and exhibit poor generalization to out-of-distribution (OOD) reasoning errors. To address these limitations, we propose the Adversarial Process Reward Model (APRM), a generator–reward-model adversarial training framework. The generator actively synthesizes progressively challenging negative samples—i.e., syntactically valid yet logically flawed reasoning chains—to compel the reward model to improve its robustness against novel logical fallacies, entirely without human annotation. APRM enables end-to-end iterative optimization via step-level supervisory signals. On mathematical reasoning benchmarks, APRM achieves an average accuracy gain of 3.4 percentage points and improves OOD generalization by 5.3 percentage points. Moreover, it significantly enhances cross-task robustness and scalability, demonstrating strong potential for deployment in diverse reasoning-intensive applications.

Technology Category

Application Category

📝 Abstract

Process Reward Models (PRMs) enhance reasoning ability of LLMs by providing step-level supervision. However, their widespread adoption is limited due to expensive manual step-level annotation and poor generalization of static training data to novel errors. We introduce Adversarially Trained PRMs ( exttt{APRM}), where a Generator ($G$) learns to produce reasoning errors to deceive a PRM ($R$), while $R$ concurrently learns to detect them. This interaction yields progressively harder negatives for $R$, improving its robustness and generalization to novel errors without requiring manual step-level labels. Averaged across diverse mathematical reasoning benchmarks, exttt{APRM} improves solver accuracy by $+3.4$ percentage points (pp) over the strongest PRM baseline. exttt{APRM} achieves gains of $+5.3$ pp on out-of-distribution tasks.

Problem

Research questions and friction points this paper is trying to address.

Enhance reasoning ability with step-level supervision

Reduce manual annotation cost for process reward models

Improve generalization to novel errors in reasoning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial training between generator and reward model

Automatically generates hard negatives without manual labels

Improves generalization to novel errors and distribution shifts

🔎 Similar Papers

No similar papers found.