Adversarial Training for Process Reward Models

📅 2025-11-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing process reward models (PRMs) rely on costly human-provided step-level annotations and exhibit poor generalization to out-of-distribution (OOD) reasoning errors. To address these limitations, we propose the Adversarial Process Reward Model (APRM), a generator–reward-model adversarial training framework. The generator actively synthesizes progressively challenging negative samples—i.e., syntactically valid yet logically flawed reasoning chains—to compel the reward model to improve its robustness against novel logical fallacies, entirely without human annotation. APRM enables end-to-end iterative optimization via step-level supervisory signals. On mathematical reasoning benchmarks, APRM achieves an average accuracy gain of 3.4 percentage points and improves OOD generalization by 5.3 percentage points. Moreover, it significantly enhances cross-task robustness and scalability, demonstrating strong potential for deployment in diverse reasoning-intensive applications.

Technology Category

Application Category

📝 Abstract
Process Reward Models (PRMs) enhance reasoning ability of LLMs by providing step-level supervision. However, their widespread adoption is limited due to expensive manual step-level annotation and poor generalization of static training data to novel errors. We introduce Adversarially Trained PRMs ( exttt{APRM}), where a Generator ($G$) learns to produce reasoning errors to deceive a PRM ($R$), while $R$ concurrently learns to detect them. This interaction yields progressively harder negatives for $R$, improving its robustness and generalization to novel errors without requiring manual step-level labels. Averaged across diverse mathematical reasoning benchmarks, exttt{APRM} improves solver accuracy by $+3.4$ percentage points (pp) over the strongest PRM baseline. exttt{APRM} achieves gains of $+5.3$ pp on out-of-distribution tasks.
Problem

Research questions and friction points this paper is trying to address.

Enhance reasoning ability with step-level supervision
Reduce manual annotation cost for process reward models
Improve generalization to novel errors in reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial training between generator and reward model
Automatically generates hard negatives without manual labels
Improves generalization to novel errors and distribution shifts
🔎 Similar Papers
No similar papers found.