Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models

📅 2026-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing generative reward models, which typically rely on unstructured reasoning length expansion and overlook the distinct advantages of breadth-oriented (B-CoT) and depth-oriented (D-CoT) reasoning across diverse tasks. To this end, we propose Mix-GRM, a novel framework that systematically disentangles and jointly optimizes B-CoT and D-CoT mechanisms for the first time. Our approach reconstructs original reasoning through modular composition and employs a hybrid optimization strategy combining supervised fine-tuning (SFT) with reinforcement learning based on verifiable rewards (RLVR). Experiments demonstrate complementary gains in both subjective preference alignment and objective correctness, with RL guiding the model to adaptively allocate reasoning styles. Mix-GRM achieves new state-of-the-art results across five benchmarks, outperforming the best open-source reward models by an average of 8.2%, and we release all data, models, and code to support reproducibility.

Technology Category

Application Category

📝 Abstract
Recent advancements in Generative Reward Models (GRMs) have demonstrated that scaling the length of Chain-of-Thought (CoT) reasoning considerably enhances the reliability of evaluation. However, current works predominantly rely on unstructured length scaling, ignoring the divergent efficacy of different reasoning mechanisms: Breadth-CoT (B-CoT, i.e., multi-dimensional principle coverage) and Depth-CoT (D-CoT, i.e., substantive judgment soundness). To address this, we introduce Mix-GRM, a framework that reconfigures raw rationales into structured B-CoT and D-CoT through a modular synthesis pipeline, subsequently employing Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR) to internalize and optimize these mechanisms. Comprehensive experiments demonstrate that Mix-GRM establishes a new state-of-the-art across five benchmarks, surpassing leading open-source RMs by an average of 8.2\%. Our results reveal a clear divergence in reasoning: B-CoT benefits subjective preference tasks, whereas D-CoT excels in objective correctness tasks. Consequently, misaligning the reasoning mechanism with the task directly degrades performance. Furthermore, we demonstrate that RLVR acts as a switching amplifier, inducing an emergent polarization where the model spontaneously allocates its reasoning style to match task demands. The synthesized data and models are released at \href{https://huggingface.co/collections/DonJoey/mix-grm}{Hugging Face}, and the code is released at \href{https://github.com/Don-Joey/Mix-GRM}{Github}.
Problem

Research questions and friction points this paper is trying to address.

Generative Reward Models
Chain-of-Thought
Breadth-CoT
Depth-CoT
reasoning mechanism
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative Reward Models
Chain-of-Thought
Breadth-Depth Synergy
Reinforcement Learning with Verifiable Rewards
Modular Synthesis
🔎 Similar Papers
No similar papers found.