🤖 AI Summary
To address the challenge in multi-lane autonomous driving where reinforcement learning struggles to simultaneously satisfy safety constraints and achieve high driving efficiency, this paper proposes a safety-oriented Harmonic Policy Iteration (HPI) framework. Methodologically, it introduces: (1) a novel harmonic gradient mechanism that dynamically fuses safety and efficiency gradients to enable conflict-minimizing policy updates; (2) the first integration of Distributed Soft Actor-Critic (DSAC) with HPI, yielding the end-to-end algorithm DSAC-H; and (3) comprehensive evaluation in a high-fidelity multi-lane simulation environment. Results demonstrate near-zero constraint violation rates, significantly improved training stability, and faster convergence compared to baseline methods including SAC and TD3. This work establishes a scalable, highly robust paradigm for constrained reinforcement learning in autonomous driving.
📝 Abstract
Reinforcement learning (RL), known for its self-evolution capability, offers a promising approach to training high-level autonomous driving systems. However, handling constraints remains a significant challenge for existing RL algorithms, particularly in real-world applications. In this paper, we propose a new safety-oriented training technique called harmonic policy iteration (HPI). At each RL iteration, it first calculates two policy gradients associated with efficient driving and safety constraints, respectively. Then, a harmonic gradient is derived for policy updating, minimizing conflicts between the two gradients and consequently enabling a more balanced and stable training process. Furthermore, we adopt the state-of-the-art DSAC algorithm as the backbone and integrate it with our HPI to develop a new safe RL algorithm, DSAC-H. Extensive simulations in multi-lane scenarios demonstrate that DSAC-H achieves efficient driving performance with near-zero safety constraint violations.