š¤ AI Summary
To address the low sample efficiency and training instability of reinforcement learning (RL) in industrial sorting, this paper proposes a GA-PPO-DQN hybrid framework. It employs a genetic algorithm (GA) to autonomously generate high-quality, transferable expert demonstration trajectories, which serve as prior knowledge to guide policy learning; integrates DQNās experience replay mechanism with proximal policy optimization (PPO); and adopts demonstration-driven warm-start initialization to accelerate convergence. The key contribution is the first systematic application of GA for generating reusable, task-agnostic demonstration data, enabling principled integration of heuristic search and deep RL. Experiments demonstrate that the method significantly improves cumulative reward and training stability of PPO agents. In real-world sorting tasks, it outperforms baseline RL approaches by a substantial margin, validating both the effectiveness and generalizability of the hybrid paradigm.
š Abstract
Reinforcement Learning (RL) has demonstrated significant potential in certain real-world industrial applications, yet its broader deployment remains limited by inherent challenges such as sample inefficiency and unstable learning dynamics. This study investigates the utilization of Genetic Algorithms (GAs) as a mechanism for improving RL performance in an industrially inspired sorting environment. We propose a novel approach in which GA-generated expert demonstrations are used to enhance policy learning. These demonstrations are incorporated into a Deep Q-Network (DQN) replay buffer for experience-based learning and utilized as warm-start trajectories for Proximal Policy Optimization (PPO) agents to accelerate training convergence. Our experiments compare standard RL training with rule-based heuristics, brute-force optimization, and demonstration data, revealing that GA-derived demonstrations significantly improve RL performance. Notably, PPO agents initialized with GA-generated data achieved superior cumulative rewards, highlighting the potential of hybrid learning paradigms, where heuristic search methods complement data-driven RL. The utilized framework is publicly available and enables further research into adaptive RL strategies for real-world applications.