🤖 AI Summary
To address the challenge of applying backpropagation-based reinforcement learning (RL) in resource-constrained and non-differentiable neural network settings, this paper proposes a gradient-free, noise-driven RL method. The approach approximates directional derivatives via stochastic neurons and couples reward prediction errors with eligibility traces to enable purely local, biologically plausible temporal credit assignment. It is the first work to integrate directional derivative theory into reward-modulated Hebbian learning (RMHL), eliminating reliance on global error signals and differentiability assumptions. Empirically, the method significantly outperforms conventional RMHL on standard RL benchmarks and matches the performance of backpropagation-based baselines, while maintaining full compatibility with neuromorphic hardware. This establishes a viable, energy-efficient learning paradigm for edge intelligence applications.
📝 Abstract
Recent advances in reinforcement learning (RL) have led to significant improvements in task performance. However, training neural networks in an RL regime is typically achieved in combination with backpropagation, limiting their applicability in resource-constrained environments or when using non-differentiable neural networks. While noise-based alternatives like reward-modulated Hebbian learning (RMHL) have been proposed, their performance has remained limited, especially in scenarios with delayed rewards, which require retrospective credit assignment over time. Here, we derive a novel noise-based learning rule that addresses these challenges. Our approach combines directional derivative theory with Hebbian-like updates to enable efficient, gradient-free learning in RL. It features stochastic noisy neurons which can approximate gradients, and produces local synaptic updates modulated by a global reward signal. Drawing on concepts from neuroscience, our method uses reward prediction error as its optimization target to generate increasingly advantageous behavior, and incorporates an eligibility trace to facilitate temporal credit assignment in environments with delayed rewards. Its formulation relies on local information alone, making it compatible with implementations in neuromorphic hardware. Experimental validation shows that our approach significantly outperforms RMHL and is competitive with BP-based baselines, highlighting the promise of noise-based, biologically inspired learning for low-power and real-time applications.