🤖 AI Summary
The prevailing assumption that process reward models (PRMs) outperform outcome reward models (ORMs) lacks rigorous cross-domain validation. Method: We systematically evaluate four reward model variants—Discrete ORM (DisORM), Discrete PRM (DisPRM), Generative ORM (GenORM), and Generative PRM (GenPRM)—across 14 diverse domains, introducing GenORM: a generative outcome verification framework that mitigates performance degradation in long reasoning chains caused by annotation noise and stepwise error accumulation—key limitations of PRMs. Contribution/Results: GenORM achieves consistent, statistically significant improvements across all domains, demonstrating superior cross-domain reliability. In contrast, PRMs exhibit no systematic advantage, owing to their fragility under fine-grained supervision. We establish the first large-scale, multi-domain, unified benchmark for reward modeling, and publicly release code, data, and models to advance the field from task-specific to general-purpose reward modeling.
📝 Abstract
The reliability of large language models (LLMs) during test-time scaling is often assessed with emph{external verifiers} or emph{reward models} that distinguish correct reasoning from flawed logic. Prior work generally assumes that process reward models (PRMs), which score every intermediate reasoning step, outperform outcome reward models (ORMs) that assess only the final answer. This view is based mainly on evidence from narrow, math-adjacent domains. We present the first unified evaluation of four reward model variants, discriminative ORM and PRM (DisORM, DisPRM) and generative ORM and PRM (GenORM, GenPRM), across 14 diverse domains. Contrary to conventional wisdom, we find that (i) DisORM performs on par with DisPRM, (ii) GenPRM is not competitive, and (iii) overall, GenORM is the most robust, yielding significant and consistent gains across every tested domain. We attribute this to PRM-style stepwise scoring, which inherits label noise from LLM auto-labeling and has difficulty evaluating long reasoning trajectories, including those involving self-correcting reasoning. Our theoretical analysis shows that step-wise aggregation compounds errors as reasoning length grows, and our empirical observations confirm this effect. These findings challenge the prevailing assumption that fine-grained supervision is always better and support generative outcome verification for multi-domain deployment. We publicly release our code, datasets, and checkpoints at href{https://github.com/db-Lee/Multi-RM}{underline{small exttt{https://github.com/db-Lee/Multi-RM}}} to facilitate future research in multi-domain settings.