Distributed Zeroth-Order Policy Gradient for Networked Multi-agent Reinforcement Learning from Human Feedback

📅 2026-05-15

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work addresses the limitations of networked multi-agent reinforcement learning—namely, reliance on centralized training, poor scalability, and the absence of explicit rewards—by proposing the first fully decentralized framework for learning from human preference feedback. The approach introduces a local human feedback mechanism based on spatiotemporally truncated trajectories and devises a zeroth-order policy gradient algorithm, enabling each agent to collaboratively optimize its policy using only state-action information and preference signals within its κ-hop neighborhood. Theoretical analysis establishes that the algorithm converges to an ε-stationary point with polynomial sample complexity. Empirical evaluations in GridWorld and predator-prey environments demonstrate the method’s effectiveness, scalability, and ability to operate without access to ground-truth reward signals.

📝 Abstract

We study a networked multi-agent reinforcement learning (NMARL) problem with human feedback in an infinite-horizon setting, where agents interact over an underlying network with localized state dependencies and aim to collaboratively maximize the average discounted return. Existing approaches with preference feedback are primarily developed for single-agent settings and rely on centralized training, which limits their scalability and applicability to large-scale networked multi-agent systems. To address this, we introduce a novel human feedback mechanism based on spatiotemporally truncated trajectories, defined as $H$-horizon trajectory pairs aggregated over each agent's $\kappa$-hop neighborhood. Building on this, we develop a distributed zeroth-order policy gradient algorithm, where each agent estimates its local policy gradient using human preference feedback generated from both the current joint policy and a perturbed joint policy drawn from zero-mean Gaussian distribution. Specifically, the algorithm is fully distributed, as the feedback received by each agent depends solely on the state-action information within its $\kappa$-hop neighborhood and does not require explicit reward signals or centralized control. We further rigorously establish that the proposed algorithm converges to an $\epsilon$-stationary point with polynomial sample complexity. Finally, simulation results in a stochastic GridWorld environment and a predator-prey environment further demonstrate that the effectiveness and scalability of the proposed algorithm in achieving collaborative optimization based solely on human preference feedback.

Problem

Research questions and friction points this paper is trying to address.

networked multi-agent reinforcement learning

human feedback

distributed optimization

zeroth-order policy gradient

scalability

Innovation

Methods, ideas, or system contributions that make the work stand out.

distributed reinforcement learning

zeroth-order optimization

human feedback