🤖 AI Summary
This work addresses the challenge that conventional reinforcement learning relies on offline training and global backpropagation, hindering online adaptation under the low-power constraints of onboard systems. To overcome this, the authors propose a proximal policy optimization framework based on Equilibrium Propagation (EP). This approach introduces EP—typically used in energy-based models—to high-dimensional continuous control for the first time, integrating central pattern generators with a residual pose adjustment strategy. It features an EP-compatible output nudging signal and a bilateral ratio clipping mechanism, replacing backpropagation with local learning rules to enable efficient online learning on neuromorphic hardware. Experiments on the A1 quadruped robot demonstrate stable convergence on uneven terrain, achieving locomotion performance comparable to standard PPO while improving GPU memory efficiency by 4.3×, significantly reducing power consumption without compromising success rate or stability.
📝 Abstract
Reinforcement learning (RL) has enabled robust quadruped locomotion over complex terrain, but most learned controllers are trained offline with backpropagation in massively parallel simulation and deployed as fixed policies, limiting adaptation to terrain variation, payload changes, actuator wear, and other real-world conditions under onboard power constraints. Local learning provides a potential path toward energy-aware on-robot adaptation by replacing global backpropagation graphs with updates driven by local neural states, making the learning rule more compatible with neuromorphic and in-memory computing substrates. This work proposes an equilibrium-propagation (EP)-based proximal policy optimization (PPO) framework for uneven-terrain quadruped locomotion. The controller combines a bio-inspired central pattern generator (CPG) policy with a residual postural adjustment policy, while replacing conventional backpropagation-trained policy and value networks with EP-enabled local learning. To train stochastic continuous-control policies with EP, we derive an EP-compatible PPO output-nudging signal and introduce a two-sided ratio clipping mechanism that stabilizes policy updates during relaxation. Experiments on a 12-DoF A1 quadruped show that the proposed controller achieves stable policy convergence in a two-stage uneven terrain locomotion task. Its locomotion performance is comparable to a backpropagation-trained PPO baseline in success rate, velocity tracking, actuator power, and body stability, while improving GPU memory efficiency by 4.3\(\times\) compared with backpropagation through time (BPTT). These results suggest that local equilibrium-based learning can support high-dimensional embodied locomotion and provide an algorithmic foundation for low-power on-robot adaptation and fine-tuning.