Efficient Federated RLHF via Zeroth-Order Policy Optimization

📅 2026-04-19
📈 Citations: 0
Influential: 0
📄 PDF

career value

238K/year
🤖 AI Summary
This work proposes Par-S²ZPO, the first federated reinforcement learning from human feedback (RLHF) algorithm that integrates zeroth-order optimization with binary perturbations, specifically designed for resource-constrained edge devices. The method substantially reduces communication, computation, and memory overhead while maintaining effective policy optimization. Theoretical analysis demonstrates that its sample complexity matches that of centralized RLHF, and its policy updates converge more rapidly. Experimental results across four MuJoCo tasks show that Par-S²ZPO consistently outperforms FedAvg-based federated RLHF baselines in both performance and efficiency, thereby validating the effectiveness and superiority of the proposed approach.

Technology Category

Application Category

📝 Abstract
This paper considers reinforcement learning from human feedback in a federated learning setting with resource-constrained agents, such as edge devices. We propose an efficient federated RLHF algorithm, named Partitioned, Sign-based Stochastic Zeroth-order Policy Optimization (Par-S$^2$ZPO). The algorithm is built on zeroth-order optimization with binary perturbation, resulting in low communication, computation, and memory complexity by design. Our theoretical analysis establishes an upper bound on the convergence rate of Par-S$^2$ZPO, revealing that it is as efficient as its centralized counterpart in terms of sample complexity but converges faster in terms of policy update iterations. Our experimental results show that it outperforms a FedAvg-based RLHF on four MuJoCo RL tasks.
Problem

Research questions and friction points this paper is trying to address.

Federated RLHF
resource-constrained agents
edge devices
human feedback
reinforcement learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Federated RLHF
Zeroth-order optimization
Binary perturbation
Communication efficiency
Policy optimization
🔎 Similar Papers
No similar papers found.