đ¤ AI Summary
Existing LLM post-training for reasoning relies on binary verifiers (0/1 correctness signals), resulting in sparse rewards and an inability to distinguish partially correct or semantically equivalent answersâthereby limiting reasoning capability gains. To address this, we propose HERO, a hybrid reinforcement learning framework that, for the first time, structurally integrates deterministic verifier signals with continuous reward-model scores. HERO introduces hierarchical normalization and variance-aware weighting to refine quality discrimination while preserving correctness guarantees. This design jointly leverages the stability of verifiers and the fine-grained discriminability of reward models. Experiments demonstrate that HERO significantly outperforms both verifier-only and reward-model-only baselines across multiple mathematical reasoning benchmarksâespecially on tasks with hard-to-verify or complex multi-step reasoningâvalidating the effectiveness and generalizability of the hybrid reward paradigm.
đ Abstract
Post-training for reasoning of large language models (LLMs) increasingly relies on verifiable rewards: deterministic checkers that provide 0-1 correctness signals. While reliable, such binary feedback is brittle--many tasks admit partially correct or alternative answers that verifiers under-credit, and the resulting all-or-nothing supervision limits learning. Reward models offer richer, continuous feedback, which can serve as a complementary supervisory signal to verifiers. We introduce HERO (Hybrid Ensemble Reward Optimization), a reinforcement learning framework that integrates verifier signals with reward-model scores in a structured way. HERO employs stratified normalization to bound reward-model scores within verifier-defined groups, preserving correctness while refining quality distinctions, and variance-aware weighting to emphasize challenging prompts where dense signals matter most. Across diverse mathematical reasoning benchmarks, HERO consistently outperforms RM-only and verifier-only baselines, with strong gains on both verifiable and hard-to-verify tasks. Our results show that hybrid reward design retains the stability of verifiers while leveraging the nuance of reward models to advance reasoning.