Confidence-Controlled Exploration: Efficient Sparse-Reward Policy Learning for Robot Navigation

📅 2023-06-09
📈 Citations: 4
Influential: 0
📄 PDF
🤖 AI Summary
To address low sample efficiency and insufficient exploration in robotic navigation under sparse rewards, this paper proposes an entropy-driven adaptive trajectory length control mechanism: trajectory truncation or extension is dynamically determined by policy entropy, enhancing gradient estimation accuracy and exploration stability without modifying the reward function. We establish, for the first time, a theoretical connection between policy entropy and gradient variance in both on-policy (PPO, REINFORCE) and off-policy (SAC) reinforcement learning frameworks. The method is validated in simulation and deployed end-to-end on a real Husky robot. Experimental results demonstrate an 18% improvement in task success rate over fixed-length and entropy-regularized baselines, a 20–38% reduction in path length, and a 9.32% decrease in climbing energy consumption.
📝 Abstract
Reinforcement learning (RL) is a promising approach for robotic navigation, allowing robots to learn through trial and error. However, real-world robotic tasks often suffer from sparse rewards, leading to inefficient exploration and suboptimal policies due to sample inefficiency of RL. In this work, we introduce Confidence-Controlled Exploration (CCE), a novel method that improves sample efficiency in RL-based robotic navigation without modifying the reward function. Unlike existing approaches, such as entropy regularization and reward shaping, which can introduce instability by altering rewards, CCE dynamically adjusts trajectory length based on policy entropy. Specifically, it shortens trajectories when uncertainty is high to enhance exploration and extends them when confidence is high to prioritize exploitation. CCE is a principled and practical solution inspired by a theoretical connection between policy entropy and gradient estimation. It integrates seamlessly with on-policy and off-policy RL methods and requires minimal modifications. We validate CCE across REINFORCE, PPO, and SAC in both simulated and real-world navigation tasks. CCE outperforms fixed-trajectory and entropy-regularized baselines, achieving an 18% higher success rate, 20-38% shorter paths, and 9.32% lower elevation costs under a fixed training sample budget. Finally, we deploy CCE on a Clearpath Husky robot, demonstrating its effectiveness in complex outdoor environments.
Problem

Research questions and friction points this paper is trying to address.

Improves sample efficiency in robot navigation RL
Dynamically adjusts trajectory length based on policy entropy
Outperforms baselines in success rate and path efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic trajectory adjustment based on policy entropy
Seamless integration with on-policy and off-policy RL
Improved sample efficiency without reward function modification
🔎 Similar Papers
No similar papers found.