An Optimal Discriminator Weighted Imitation Perspective for Reinforcement Learning

📅 2025-04-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing dual reinforcement learning (Dual-RL) methods struggle to accurately estimate the optimal state visitation distribution ratio from offline data in the absence of expert demonstrations, severely limiting policy optimization. To address this, we propose Iterative Dual Reinforcement Learning (IDRL), which—uniquely—rigorously proves the equivalence between the optimal discriminator weights and the true state visitation distribution ratio. Leveraging this theoretical insight, IDRL introduces an iterative pruning mechanism that progressively reconstructs sub-datasets within a behavior cloning–Dual-RL framework, thereby approximating the theoretically optimal ratio and inducing an interpretable, curriculum-style distribution optimization. Empirically, IDRL consistently outperforms both Primal-RL and Dual-RL baselines across D4RL benchmarks and noisy demonstration datasets, achieving superior performance and significantly improved training stability.

Technology Category

Application Category

📝 Abstract
We introduce Iterative Dual Reinforcement Learning (IDRL), a new method that takes an optimal discriminator-weighted imitation view of solving RL. Our method is motivated by a simple experiment in which we find training a discriminator using the offline dataset plus an additional expert dataset and then performing discriminator-weighted behavior cloning gives strong results on various types of datasets. That optimal discriminator weight is quite similar to the learned visitation distribution ratio in Dual-RL, however, we find that current Dual-RL methods do not correctly estimate that ratio. In IDRL, we propose a correction method to iteratively approach the optimal visitation distribution ratio in the offline dataset given no addtional expert dataset. During each iteration, IDRL removes zero-weight suboptimal transitions using the learned ratio from the previous iteration and runs Dual-RL on the remaining subdataset. This can be seen as replacing the behavior visitation distribution with the optimized visitation distribution from the previous iteration, which theoretically gives a curriculum of improved visitation distribution ratios that are closer to the optimal discriminator weight. We verify the effectiveness of IDRL on various kinds of offline datasets, including D4RL datasets and more realistic corrupted demonstrations. IDRL beats strong Primal-RL and Dual-RL baselines in terms of both performance and stability, on all datasets.
Problem

Research questions and friction points this paper is trying to address.

Improves offline RL via optimal discriminator-weighted imitation
Corrects Dual-RL's suboptimal visitation distribution ratio estimation
Enhances performance and stability on diverse offline datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses optimal discriminator-weighted imitation learning
Iteratively corrects visitation distribution ratio
Removes suboptimal transitions via learned ratio
🔎 Similar Papers
No similar papers found.
H
Haoran Xu
University of Texas at Austin
Shuozhe Li
Shuozhe Li
University of Texas at Austin
Reinforcement LearningRobot LearningComputer Network
H
Harshit S. Sikchi
University of Texas at Austin
S
S. Niekum
UMass Amherst
A
Amy Zhang
Meta AI