🤖 AI Summary
This work addresses the instability in gradient-free optimization of recurrent spiking neural networks (RSNNs) for high-dimensional, long-horizon reinforcement learning, which often stems from high estimator variance. To mitigate this, the authors propose the Signal-Adaptive Trust Region (SATR) method, which introduces, for the first time, a KL-divergence constraint normalized by signal energy to guide RSNN policy updates. The trust region dynamically expands under strong signal conditions and contracts when noise dominates, thereby enhancing optimization stability—particularly with small population sizes. By integrating Bernoulli connection distribution modeling and bitset acceleration techniques, SATR achieves significantly better performance than existing gradient-free approaches across multiple high-dimensional continuous control tasks, matching the training stability of PPO-LSTM while substantially reducing wall-clock training time.
📝 Abstract
Recurrent spiking neural networks (RSNNs) are a promising substrate for energy-efficient control policies, but training them for high-dimensional, long-horizon reinforcement learning remains challenging. Population-based, gradient-free optimization circumvents backpropagation through non-differentiable spike dynamics by estimating gradients. However, with finite populations, high variance of these estimates can induce harmful and overly aggressive update steps. Inspired by trust-region methods in reinforcement learning that constrain policy updates in distribution space, we propose \textbf{Signal-Adaptive Trust Regions (SATR)}, a distributional update rule that constrains relative change by bounding KL divergence normalized by an estimated signal energy. SATR automatically expands the trust region under strong signals and contracts it when updates are noise-dominated. We instantiate SATR for Bernoulli connectivity distributions, which have shown strong empirical performance for RSNN optimization. Across a suite of high-dimensional continuous-control benchmarks, SATR improves stability under limited populations and reaches competitive returns against strong baselines including PPO-LSTM. In addition, to make SATR practical at scale, we introduce a bitset implementation for binary spiking and binary weights, substantially reducing wall-clock training time and enabling fast RSNN policy search.