When Learning Rates Go Wrong: Early Structural Signals in PPO Actor-Critic

📅 2026-03-10

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses the high sensitivity of Proximal Policy Optimization (PPO) training to learning rate selection and the absence of effective early indicators for evaluating hyperparameter configurations. The authors propose an early stopping criterion based on the Overfitting-Underfitting Indicator (OUI), which analyzes the dynamic sign patterns of hidden neuron activations in the Actor-Critic networks. This method enables high-accuracy identification of suitable learning rates using only the first 10% of training progress. Theoretical analysis reveals a strong connection between learning rates and neuronal activation dynamics, further uncovering distinct OUI characteristics between the Actor and Critic at optimal performance. Extensive experiments across multiple environments and random seeds demonstrate that OUI—used alone or in conjunction with early return signals—significantly outperforms existing early stopping baselines, enabling efficient pruning of low-potential training runs.

Technology Category

Application Category

📝 Abstract

Deep Reinforcement Learning systems are highly sensitive to the learning rate (LR), and selecting stable and performant training runs often requires extensive hyperparameter search. In Proximal Policy Optimization (PPO) actor--critic methods, small LR values lead to slow convergence, whereas large LR values may induce instability or collapse. We analyse this phenomenon from the behavior of the hidden neurons in the network using the Overfitting-Underfitting Indicator (OUI), a metric that quantifies the balance of binary activation patterns over a fixed probe batch. We introduce an efficient batch-based formulation of OUI and derive a theoretical connection between LR and activation sign changes, clarifying how a correct evolution of the neuron's inner structure depends on the step size. Empirically, across three discrete-control environments and multiple seeds, we show that OUI measured at only 10\% of training already discriminates between LR regimes. We observe a consistent asymmetry: critic networks achieving highest return operate in an intermediate OUI band (avoiding saturation), whereas actor networks achieving highest return exhibit comparatively high OUI values. We then compare OUI-based screening rules against early return, clip-based, divergence-based, and flip-based criteria under matched recall over successful runs. In this setting, OUI provides the strongest early screening signal: OUI alone achieves the best precision at broader recall, while combining early return with OUI yields the highest precision in best-performing screening regimes, enabling aggressive pruning of unpromising runs without requiring full training.

Problem

Research questions and friction points this paper is trying to address.

learning rate

PPO

early signal

actor-critic

hyperparameter sensitivity

Innovation

Methods, ideas, or system contributions that make the work stand out.

learning rate sensitivity

Overfitting-Underfitting Indicator (OUI)

early stopping criterion