Efficient Federated RLHF via Zeroth-Order Policy Optimization

📅 2026-04-19

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

This work proposes Par-S²ZPO, the first federated reinforcement learning from human feedback (RLHF) algorithm that integrates zeroth-order optimization with binary perturbations, specifically designed for resource-constrained edge devices. The method substantially reduces communication, computation, and memory overhead while maintaining effective policy optimization. Theoretical analysis demonstrates that its sample complexity matches that of centralized RLHF, and its policy updates converge more rapidly. Experimental results across four MuJoCo tasks show that Par-S²ZPO consistently outperforms FedAvg-based federated RLHF baselines in both performance and efficiency, thereby validating the effectiveness and superiority of the proposed approach.

Technology Category

Application Category

📝 Abstract

This paper considers reinforcement learning from human feedback in a federated learning setting with resource-constrained agents, such as edge devices. We propose an efficient federated RLHF algorithm, named Partitioned, Sign-based Stochastic Zeroth-order Policy Optimization (Par-S$^2$ZPO). The algorithm is built on zeroth-order optimization with binary perturbation, resulting in low communication, computation, and memory complexity by design. Our theoretical analysis establishes an upper bound on the convergence rate of Par-S$^2$ZPO, revealing that it is as efficient as its centralized counterpart in terms of sample complexity but converges faster in terms of policy update iterations. Our experimental results show that it outperforms a FedAvg-based RLHF on four MuJoCo RL tasks.

Problem

Research questions and friction points this paper is trying to address.

Federated RLHF

resource-constrained agents

edge devices

human feedback

reinforcement learning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Federated RLHF

Zeroth-order optimization

Binary perturbation