General Flexible $f$-divergence for Challenging Offline RL Datasets with Low Stochasticity and Diverse Behavior Policies

📅 2026-02-11

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This work addresses the challenge of offline reinforcement learning under low-exploration, multi-policy datasets, where accurate value estimation and the trade-off between policy improvement and data constraints are difficult to achieve. The authors propose a novel approach that, for the first time, integrates a general and flexible $f$-divergence framework with Bellman residual constraints. By leveraging convex conjugates and linear programming, the method constructs an adaptive objective that dynamically modulates the strength of constraints imposed by complex offline data distributions. Evaluated on standard benchmarks including MuJoCo, Fetch, and AdroitHand, the approach demonstrates significant improvements in policy performance and robustly handles challenging offline datasets.

Technology Category

Application Category

📝 Abstract

Offline RL algorithms aim to improve upon the behavior policy that produces the collected data while constraining the learned policy to be within the support of the dataset. However, practical offline datasets often contain examples with little diversity or limited exploration of the environment, and from multiple behavior policies with diverse expertise levels. Limited exploration can impair the offline RL algorithm's ability to estimate \textit{Q} or \textit{V} values, while constraining towards diverse behavior policies can be overly conservative. Such datasets call for a balance between the RL objective and behavior policy constraints. We first identify the connection between $f$-divergence and optimization constraint on the Bellman residual through a more general Linear Programming form for RL and the convex conjugate. Following this, we introduce the general flexible function formulation for the $f$-divergence to incorporate an adaptive constraint on algorithms'learning objectives based on the offline training dataset. Results from experiments on the MuJoCo, Fetch, and AdroitHand environments show the correctness of the proposed LP form and the potential of the flexible $f$-divergence in improving performance for learning from a challenging dataset when applied to a compatible constrained optimization algorithm.

Problem

Research questions and friction points this paper is trying to address.

Offline Reinforcement Learning

f-divergence

Behavior Policy Constraint

Low Stochasticity

Diverse Behavior Policies

Innovation

Methods, ideas, or system contributions that make the work stand out.

flexible f-divergence

offline reinforcement learning

behavior policy constraint