SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data

📅 2026-05-07

📈 Citations: 0

✨ Influential: 0

career value

197K/year

🤖 AI Summary

This work addresses the trade-off between computational efficiency and overfitting in online reinforcement learning when leveraging prior offline data, as well as the need for manual tuning of fixed offline pretraining schedules. To this end, we propose SOPE, the first algorithm to employ a policy-alignment-aware off-policy evaluation (OPE) signal as an adaptive early-stopping mechanism. By dynamically halting offline pretraining upon detecting performance saturation of the value function on a validation set, SOPE eliminates the need for handcrafted scheduling while effectively mitigating overfitting and maximizing the utility of prior data. Evaluated on 25 continuous-control tasks from the Minari benchmark, SOPE achieves up to a 45.6% performance gain over baselines and reduces computational cost by up to 22× in TFLOPs, significantly improving both sample and computational efficiency.

📝 Abstract

Incorporating prior data into online reinforcement learning accelerates training but typically forces a difficult trade-off between high computational costs and long, multi-stage training pipelines. While fixed-length stabilization phases are significantly more computationally efficient than static update schedules, they require task-dependent manual tuning, risking either the waste of prior knowledge or severe overfitting. To address this, we propose SOPE, an algorithm that uses an actor-aligned Off-Policy Policy Evaluation (OPE) signal as an automated early-stopping mechanism to dynamically control the length of offline training phases. By evaluating the critic on a held-out validation split under the current policy's action distribution, SOPE halts gradient updates exactly when out-of-distribution benefits saturate, eliminating the need for manual schedule tuning. Evaluated on 25 continuous control tasks from the Minari benchmark suite, SOPE improves baseline performance by up to 45.6% while reducing the required TFLOPs by up to 22x, thus balancing the tradeoff between sample and computational efficiency. These findings demonstrate that adaptive, evaluation-driven update schedules are more effective than relying on static, exhaustive update schedules.

Problem

Research questions and friction points this paper is trying to address.

off-policy evaluation

online reinforcement learning

prior data

training stabilization

computational efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Off-Policy Evaluation

Early Stopping

Adaptive Training Schedule