🤖 AI Summary
This work identifies a pervasive “action collapse” (AC) phenomenon in policy gradient (PG) training of deep policy networks: for states sharing the same optimal action, activations in the action layer converge to their mean with vanishing variance, while both weights and activations jointly collapse into an equiangular tight frame (ETF) on the probability simplex. Building on this finding, the authors propose—*for the first time*—using a synthetically constructed ETF as a *fixed structural prior* for the action-selection layer to explicitly guide representation learning. The method introduces no additional parameters or external supervision. Evaluated on multiple OpenAI Gym discrete-control benchmarks, it significantly accelerates PG convergence, improves final reward performance, and enhances training stability. Key contributions are: (i) the first systematic characterization of AC and its underlying ETF structure; and (ii) the novel application of neural collapse theory to design a structured inductive bias, yielding more robust and efficient policy optimization.
📝 Abstract
Policy gradient (PG) methods in reinforcement learning frequently utilize deep neural networks (DNNs) to learn a shared backbone of feature representations used to compute likelihoods in an action selection layer. Numerous studies have been conducted on the convergence and global optima of policy networks, but few have analyzed representational structures of those underlying networks. While training an optimal policy DNN, we observed that under certain constraints, a gentle structure resembling neural collapse, which we refer to as Action Collapse (AC), emerges. This suggests that 1) the state-action activations (i.e. last-layer features) sharing the same optimal actions collapse towards those optimal actions respective mean activations; 2) the variability of activations sharing the same optimal actions converges to zero; 3) the weights of action selection layer and the mean activations collapse to a simplex equiangular tight frame (ETF). Our early work showed those aforementioned constraints to be necessary for these observations. Since the collapsed ETF of optimal policy DNNs maximally separates the pair-wise angles of all actions in the state-action space, we naturally raise a question: can we learn an optimal policy using an ETF structure as a (fixed) target configuration in the action selection layer? Our analytical proof shows that learning activations with a fixed ETF as action selection layer naturally leads to the AC. We thus propose the Action Collapse Policy Gradient (ACPG) method, which accordingly affixes a synthetic ETF as our action selection layer. ACPG induces the policy DNN to produce such an ideal configuration in the action selection layer while remaining optimal. Our experiments across various OpenAI Gym environments demonstrate that our technique can be integrated into any discrete PG methods and lead to favorable reward improvements more quickly and robustly.