๐ค AI Summary
Supervising the logical validity of intermediate reasoning steps in multi-step inference remains challenging due to the difficulty of obtaining reliable, fine-grained step-level feedback. Method: This paper proposes a generative judge model that reformulates step-level reward modeling as an interpretable meta-reasoning task. Instead of relying on static annotations or black-box scoring, it employs a reinforcement learning framework that optimizes a generative judgment policy via relative rollout outcomes, producing fine-grained, process-aware step evaluation tokens. Contribution/Results: To our knowledge, this is the first work to cast judging as a generative reasoning taskโenabling traceable criteria and fully interpretable judgments. Moreover, it supports online policy optimization and accelerated inference search. Experiments demonstrate significant improvements over existing baselines in intermediate-step accuracy, while also enhancing final answer quality and search efficiency.
๐ Abstract
As models increasingly leverage multi-step reasoning strategies to solve complex problems, supervising the logical validity of these intermediate steps has become a critical research challenge. Process reward models address this by providing step-by-step feedback, but current approaches have two major drawbacks: they typically function as classifiers without providing explanations, and their reliance on supervised fine-tuning with static datasets limits generalization. Inspired by recent advances, we reframe stepwise reward modeling from a classification task to a reasoning task itself. We thus propose a generative judge that reasons about the policy model's reasoning steps (i.e., meta-reasons), outputting thinking tokens before delivering a final verdict. Our model, StepWiser, is trained by reinforcement learning using relative outcomes of rollouts. We show it provides (i) better judgment accuracy on intermediate steps than existing methods; (ii) can be used to improve the policy model at training time; and (iii) improves inference-time search.