🤖 AI Summary
This work proposes Par-S²ZPO, the first federated reinforcement learning from human feedback (RLHF) algorithm that integrates zeroth-order optimization with binary perturbations, specifically designed for resource-constrained edge devices. The method substantially reduces communication, computation, and memory overhead while maintaining effective policy optimization. Theoretical analysis demonstrates that its sample complexity matches that of centralized RLHF, and its policy updates converge more rapidly. Experimental results across four MuJoCo tasks show that Par-S²ZPO consistently outperforms FedAvg-based federated RLHF baselines in both performance and efficiency, thereby validating the effectiveness and superiority of the proposed approach.
📝 Abstract
This paper considers reinforcement learning from human feedback in a federated learning setting with resource-constrained agents, such as edge devices. We propose an efficient federated RLHF algorithm, named Partitioned, Sign-based Stochastic Zeroth-order Policy Optimization (Par-S$^2$ZPO). The algorithm is built on zeroth-order optimization with binary perturbation, resulting in low communication, computation, and memory complexity by design. Our theoretical analysis establishes an upper bound on the convergence rate of Par-S$^2$ZPO, revealing that it is as efficient as its centralized counterpart in terms of sample complexity but converges faster in terms of policy update iterations. Our experimental results show that it outperforms a FedAvg-based RLHF on four MuJoCo RL tasks.