AURORA:Automated Training Framework of Universal Process Reward Models via Ensemble Prompting and Reverse Verification

📅 2025-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Evaluating complex reasoning processes in large language models (LLMs) remains challenging due to high annotation costs and limited human labeling accuracy. To address this, we propose the Process Reward Model (PRM), a general-purpose framework for automatic evaluation and optimization of diverse policy distributions and long-chain-of-thought (CoT) outputs. Methodologically, we introduce a novel two-stage automated training pipeline: (1) multi-prompt strategy integration with ensemble-based labeling to enhance robustness in process-level assessment; and (2) reference-answer-guided backward verification to improve training fidelity. Technically, our approach unifies process-level supervised learning, ensemble prompt engineering, construction of the UniversalBench benchmark, and systematic evaluation of long CoT trajectories. Experiments demonstrate that PRM substantially improves evaluation accuracy across heterogeneous reasoning policies and extended inference paths. We publicly release the Universal-PRM-7B model and the UniversalBench benchmark.

Technology Category

Application Category

📝 Abstract
The reasoning capabilities of advanced large language models (LLMs) like o1 have revolutionized artificial intelligence applications. Nevertheless, evaluating and optimizing complex reasoning processes remain significant challenges due to diverse policy distributions and the inherent limitations of human effort and accuracy. In this paper, we present AURORA, a novel automated framework for training universal process reward models (PRMs) using ensemble prompting and reverse verification. The framework employs a two-phase approach: First, it uses diverse prompting strategies and ensemble methods to perform automated annotation and evaluation of processes, ensuring robust assessments for reward learning. Second, it leverages practical reference answers for reverse verification, enhancing the model's ability to validate outputs and improving training accuracy. To assess the framework's performance, we extend beyond the existing ProcessBench benchmark by introducing UniversalBench, which evaluates reward predictions across full trajectories under diverse policy distribtion with long Chain-of-Thought (CoT) outputs. Experimental results demonstrate that AURORA enhances process evaluation accuracy, improves PRMs' accuracy for diverse policy distributions and long-CoT responses. The project will be open-sourced at https://auroraprm.github.io/. The Universal-PRM-7B is available at https://huggingface.co/infly/Universal-PRM-7B.
Problem

Research questions and friction points this paper is trying to address.

Automated training of reward models
Ensemble prompting for robust evaluations
Reverse verification for output validation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated annotation via ensemble prompting
Reverse verification using reference answers
UniversalBench for diverse policy evaluation
🔎 Similar Papers
No similar papers found.
X
Xiaoyu Tan
INFLY TECH (Shanghai) Co., Ltd., Shanghai, China
T
Tianchu Yao
INFLY TECH (Shanghai) Co., Ltd., Shanghai, China
C
Chao Qu
INFLY TECH (Shanghai) Co., Ltd., Shanghai, China
B
Bin Li
Shanghai University of Engineering Science, Shanghai, China
M
Minghao Yang
Fudan University, Shanghai, China
D
Dakuan Lu
INFLY TECH (Shanghai) Co., Ltd., Shanghai, China
H
Haozhe Wang
INFLY TECH (Shanghai) Co., Ltd., Shanghai, China
Xihe Qiu
Xihe Qiu
Associate Professor, Shanghai University of Engineering Science
AI for HealthcareVision-Language ModelsReinforcement LearningLarge Language Models
W
Wei Chu
INFLY TECH (Shanghai) Co., Ltd., Shanghai, China
Yinghui Xu
Yinghui Xu
Research Scientist/Senior Director
machine learningmachine visionoptimization
Y
Yuan Qi
Fudan University, Shanghai, China