Improving policy exploitation in online reinforcement learning with instant retrospect action.

📅 2026-01-27
🏛️ Neural Networks
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the slow learning often observed in existing value-based online reinforcement learning algorithms, which stems from inefficient exploration and delayed policy updates. To overcome these limitations, the paper introduces three key techniques: Representation Difference Evolution (RDE) to enhance the discriminative power of state-action representations, Greedy Action Guidance (GAG) to improve the directionality of exploration, and Instant Policy Update (IPU) to eliminate policy lag. Additionally, the authors incorporate k-nearest neighbor action-value estimation and demonstrate that adopting a conservative policy during early training effectively mitigates value overestimation. Evaluated on eight MuJoCo continuous control tasks, the proposed method achieves substantial improvements in both sample efficiency and final performance.

Technology Category

Application Category

Problem

Research questions and friction points this paper is trying to address.

policy exploitation
online reinforcement learning
exploration inefficiency
delayed policy updates
value-based RL
Innovation

Methods, ideas, or system contributions that make the work stand out.

Instant Retrospect Action
Q-Representation Discrepancy Evolution
Greedy Action Guidance
Instant Policy Update
Policy Constraints
G
Gong Gao
School of Computer Science, Tongji University, China
Weidong Zhao
Weidong Zhao
Shandong University
Numerical analysis
X
Xianhui Liu
School of Computer Science, Tongji University, China
Ning Jia
Ning Jia
Tianjin University