🤖 AI Summary
To address insufficient policy diversity and limited exploration in complex robotic manipulation tasks under reinforcement learning, this paper proposes a trajectory-first curriculum learning paradigm. First, it models the high-dimensional behavioral trajectory distribution to enable global exploration; then, it transfers trajectory-level knowledge to step-wise policies via hierarchical policy distillation. This approach overcomes the local exploration bottleneck imposed by conventional skill-space constraints, integrating constrained diversity optimization, trajectory clustering, and hybrid training—comprising behavior cloning initialization followed by PPO fine-tuning. Evaluated on multi-stage robotic manipulation tasks, the method achieves a 42% improvement in policy diversity (measured by Jensen–Shannon divergence) over skill-based baselines, while demonstrating superior robustness and cross-task generalization capability.
📝 Abstract
Being able to solve a task in diverse ways makes agents more robust to task variations and less prone to local optima. In this context, constrained diversity optimization has emerged as a powerful reinforcement learning (RL) framework to train a diverse set of agents in parallel. However, existing constrained-diversity RL methods often under-explore in complex tasks such as robotic manipulation, leading to a lack in policy diversity. To improve diversity optimization in RL, we therefore propose a curriculum that first explores at the trajectory level before learning step-based policies. In our empirical evaluation, we provide novel insights into the shortcoming of skill-based diversity optimization, and demonstrate empirically that our curriculum improves the diversity of the learned skills.