Proximal Action Replacement for Behavior Cloning Actor-Critic in Offline Reinforcement Learning

📅 2026-02-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses a key limitation in offline reinforcement learning: behavior cloning (BC) regularization often constrains policy performance by blindly imitating suboptimal actions, particularly suppressing exploration of high-value actions in later training stages. To mitigate this issue, the authors propose Proximal Action Replacement (PAR), a plug-and-play data augmentation mechanism within the Actor-Critic framework. PAR progressively replaces low-value actions in the replay buffer with high-value actions generated by a stabilized policy. This approach effectively reduces overreliance on suboptimal demonstrations, is compatible with various BC-regularized algorithms, and consistently yields significant performance gains across multiple offline RL benchmarks. Notably, when integrated with TD3+BC, PAR achieves performance approaching state-of-the-art levels.

Technology Category

Application Category

📝 Abstract
Offline reinforcement learning (RL) optimizes policies from a previously collected static dataset and is an important branch of RL. A popular and promising approach is to regularize actor-critic methods with behavior cloning (BC), which yields realistic policies and mitigates bias from out-of-distribution actions, but can impose an often-overlooked performance ceiling: when dataset actions are suboptimal, indiscriminate imitation structurally prevents the actor from fully exploiting high-value regions suggested by the critic, especially in later training when imitation is already dominant. We formally analyzed this limitation by investigating convergence properties of BC-regularized actor-critic optimization and verified it on a controlled continuous bandit task. To break this ceiling, we propose proximal action replacement (PAR), a plug-and-play training sample replacer that progressively replaces low-value actions with high-value actions generated by a stable actor, broadening the action exploration space while reducing the impact of low-value data. PAR is compatible with multiple BC regularization paradigms. Extensive experiments across offline RL benchmarks show that PAR consistently improves performance and approaches state-of-the-art when combined with the basic TD3+BC.
Problem

Research questions and friction points this paper is trying to address.

offline reinforcement learning
behavior cloning
actor-critic
performance ceiling
suboptimal actions
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proximal Action Replacement
Offline Reinforcement Learning
Behavior Cloning
Actor-Critic
Action Replacement
🔎 Similar Papers
No similar papers found.
J
Jinzong Dong
School of Automation, Central South University, Changsha, China; Shanghai AI Laboratory, Shanghai, China
W
Wei Huang
Shanghai AI Laboratory, Shanghai, China
J
Jianshu Zhang
Shanghai AI Laboratory, Shanghai, China; Shanghai Jiao Tong University, Shanghai, China
Z
Zhuo Chen
Shanghai AI Laboratory, Shanghai, China; Shanghai Jiao Tong University, Shanghai, China
X
Xinzhe Yuan
Shanghai AI Laboratory, Shanghai, China
Q
Qinying Gu
Shanghai AI Laboratory, Shanghai, China
Z
Zhaohui Jiang
School of Automation, Central South University, Changsha, China; Shanghai AI Laboratory, Shanghai, China
Nanyang Ye
Nanyang Ye
Shanghai Jiao Tong University
Out-of-Distribution GeneralizationEmbodied AIUnmanned Aerial VehicleHDR Imaging