Reinforcement Learning with Conditional Expectation Reward

📅 2026-03-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Traditional rule-based verifiers struggle to generalize to free-form answers in open-ended reasoning tasks, limiting the applicability of reinforcement learning with verifiable rewards (RLVR). This work proposes Conditional Expected Reward (CER), a novel approach that leverages the intrinsic capabilities of large language models to construct a general, differentiable soft reward mechanism. CER computes the expected likelihood of reference answers conditioned on the model-generated response, eliminating the need for external or handcrafted verification rules. Evaluated on mathematical and general reasoning benchmarks, CER demonstrates substantial performance improvements, establishing its effectiveness as a flexible and universally applicable verification mechanism for RLVR frameworks.

Technology Category

Application Category

📝 Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has proven effective in enhancing the reasoning capabilities of large language models, particularly in domains such as mathematics where reliable rule-based verifiers can be constructed. However, the reliance on handcrafted, domain-specific verification rules substantially limits the applicability of RLVR to general reasoning domains with free-form answers, where valid answers often exhibit significant variability, making it difficult to establish complete and accurate rules. To address this limitation, we propose Conditional Expectation Reward (CER), which leverages the large language model itself as an implicit verifier, and is therefore applicable to general domains and eliminates the need for external verifiers or auxiliary models. CER is defined as the expected likelihood of generating the reference answer conditioned on the generated answer. In contrast to rule-based verifiers that yield binary feedback, CER provides a soft, graded reward signal that reflects varying degrees of correctness, making it better suited to tasks where answers vary in correctness. Experimental results demonstrate that CER is effective across a wide range of reasoning tasks, spanning both mathematical and general domains, indicating that CER serves as a flexible and general verification mechanism. The code is available at https://github.com/changyi7231/CER.
Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning
Verifiable Rewards
Conditional Expectation Reward
General Reasoning
Free-form Answers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Conditional Expectation Reward
Reinforcement Learning
Large Language Models
Soft Reward Signal
Implicit Verification
🔎 Similar Papers
No similar papers found.