Improving policy exploitation in online reinforcement learning with instant retrospect action.

📅 2026-01-27

🏛️ Neural Networks

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the slow learning often observed in existing value-based online reinforcement learning algorithms, which stems from inefficient exploration and delayed policy updates. To overcome these limitations, the paper introduces three key techniques: Representation Difference Evolution (RDE) to enhance the discriminative power of state-action representations, Greedy Action Guidance (GAG) to improve the directionality of exploration, and Instant Policy Update (IPU) to eliminate policy lag. Additionally, the authors incorporate k-nearest neighbor action-value estimation and demonstrate that adopting a conservative policy during early training effectively mitigates value overestimation. Evaluated on eight MuJoCo continuous control tasks, the proposed method achieves substantial improvements in both sample efficiency and final performance.