Discounted Beta--Bernoulli Reward Estimation for Sample-Efficient Reinforcement Learning with Verifiable Rewards

📅 2026-03-18

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

This work addresses the high variance, variance collapse, and underutilization of responses in existing population-based Reinforcement Learning with Verifiable Rewards (RLVR) methods, which stem from relying on only a few trajectories for reward point estimation. From a statistical estimation perspective, we reformulate RLVR by interpreting advantage computation as a finite-sample estimate of the policy-induced reward distribution. We propose the Discounted Beta-Bernoulli (DBB) reward estimator, which integrates historical rewards via a Bayesian Beta-Bernoulli model augmented with a discounting mechanism to effectively handle non-stationary distributions. DBB significantly reduces and stabilizes variance under controlled bias, avoids variance collapse, and achieves lower mean squared error than conventional point estimators. Integrated with GRPO, DBB improves Acc@8 by average margins of 3.22/2.42 and 12.49/6.92 percentage points on six in-distribution and three out-of-distribution benchmarks for 1.7B and 8B models, respectively, without additional computational or memory overhead.

Technology Category

Application Category

📝 Abstract

Reinforcement learning with verifiable rewards (RLVR) has emerged as an effective post-training paradigm for improving the reasoning capabilities of large language models. However, existing group-based RLVR methods often suffer from severe sample inefficiency. This inefficiency stems from reliance on point estimation of rewards from a small number of rollouts, leading to high estimation variance, variance collapse, and ineffective utilization of generated responses. In this work, we reformulate RLVR from a statistical estimation perspective by modeling rewards as samples drawn from a policy-induced distribution and casting advantage computation as the problem of estimating the reward distribution from finite data. Building on this view, we propose Discounted Beta--Bernoulli (DBB) reward estimation, which leverages historical reward statistics for the non-stationary distribution. Although biased, the resulting estimator exhibits reduced and stable variance, theoretically avoids estimated variance collapse, and achieves lower mean squared error than standard point estimation. Extensive experiments across six in-distribution and three out-of-distribution reasoning benchmarks demonstrate that GRPO with DBB consistently outperforms naive GRPO, achieving average Acc@8 improvements of 3.22/2.42 points in-distribution and 12.49/6.92 points out-of-distribution on the 1.7B and 8B models, respectively, without additional computational cost or memory usage.

Problem

Research questions and friction points this paper is trying to address.

sample inefficiency

reward estimation

variance collapse

reinforcement learning with verifiable rewards

point estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Discounted Beta-Bernoulli

reward estimation

sample-efficient RL