Peng's Q($λ$) for Conservative Value Estimation in Offline Reinforcement Learning

📅 2026-05-14

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Offline reinforcement learning often suffers from value overestimation or excessive pessimism due to distributional shift, and existing conservative methods struggle to surpass the performance of the behavior policy. This work proposes CPQL, a model-free multi-step algorithm that, for the first time, integrates Peng’s Q(λ) operator into conservative Q-learning to replace the standard Bellman operator. This approach implicitly regularizes policy updates while fully leveraging offline trajectories. Theoretical analysis and empirical results demonstrate that CPQL effectively mitigates over-pessimism and overcomes the longstanding trade-off between performance improvement and near-optimality guarantees. On the D4RL benchmark, CPQL significantly outperforms single-step baselines, and its pretrained Q-function enables efficient online fine-tuning—avoiding initial performance degradation and achieving robust gains.

📝 Abstract

We propose a model-free offline multi-step reinforcement learning (RL) algorithm, Conservative Peng's Q($λ$) (CPQL). Our algorithm adapts the Peng's Q($λ$) (PQL) operator for conservative value estimation as an alternative to the Bellman operator. To the best of our knowledge, this is the first work in offline RL to theoretically and empirically demonstrate the effectiveness of conservative value estimation with a \textit{multi-step} operator by fully leveraging offline trajectories. The fixed point of the PQL operator in offline RL lies closer to the value function of the behavior policy, thereby naturally inducing implicit behavior regularization. CPQL simultaneously mitigates over-pessimistic value estimation, achieves performance greater than (or equal to) that of the behavior policy, and provides near-optimal performance guarantees -- a milestone that previous conservative approaches could not achieve. Extensive numerical experiments on the D4RL benchmark demonstrate that CPQL consistently and significantly outperforms existing offline single-step baselines. In addition to the contributions of CPQL in offline RL, our proposed method also contributes to the offline-to-online learning framework. Using the Q-function pre-trained by CPQL in offline settings enables the online PQL agent to avoid the performance drop typically observed at the start of fine-tuning and to attain robust performance improvements. Our code is available at https://github.com/oh-lab/CPQL.

Problem

Research questions and friction points this paper is trying to address.

offline reinforcement learning

conservative value estimation

multi-step operator

behavior policy

value over-pessimism

Innovation

Methods, ideas, or system contributions that make the work stand out.

Conservative Value Estimation

Multi-step Reinforcement Learning

Offline RL