🤖 AI Summary
This work investigates the robustness of offline multi-agent reinforcement learning from human feedback (MARLHF) under adversarial corruption of preference data. Assuming that a fraction of trajectory-preference pairs are arbitrarily corrupted, the paper proposes the first computationally tractable quasi-polynomial-time algorithm within the linear Markov game framework. Under a uniform coverage assumption, the algorithm achieves a Nash equilibrium suboptimality gap of $O(\varepsilon^{1-o(1)})$; under the weaker unilateral coverage condition, it attains a coarse correlated equilibrium (CCE) gap of $O(\sqrt{\varepsilon})$. This study is the first to systematically address adversarial contamination in offline MARLHF and establishes a theoretical connection between data coverage conditions and equilibrium quality.
📝 Abstract
We consider robustness against data corruption in offline multi-agent reinforcement learning from human feedback (MARLHF) under a strong-contamination model: given a dataset $D$ of trajectory-preference tuples (each preference being an $n$-dimensional binary label vector representing each of the $n$ agents' preferences), an $ε$-fraction of the samples may be arbitrarily corrupted. We model the problem using the framework of linear Markov games. First, under a uniform coverage assumption - where every policy of interest is sufficiently represented in the clean (prior to corruption) data - we introduce a robust estimator that guarantees an $O(ε^{1 - o(1)})$ bound on the Nash equilibrium gap. Next, we move to the more challenging unilateral coverage setting, in which only a Nash equilibrium and its single-player deviations are covered. In this case, our proposed algorithm achieves an $O(\sqrtε)$ bound on the Nash gap. Both of these procedures, however, suffer from intractable computation. To address this, we relax our solution concept to coarse correlated equilibria (CCE). Under the same unilateral coverage regime, we derive a quasi-polynomial-time algorithm whose CCE gap scales as $O(\sqrtε)$. To the best of our knowledge, this is the first systematic treatment of adversarial data corruption in offline MARLHF.