O-MAPL: Offline Multi-agent Preference Learning

📅 2025-01-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Inferring implicit reward functions from human preferences in offline multi-agent settings remains challenging; existing two-stage approaches—first learning a reward function, then optimizing policies—suffer from error propagation and training instability. Method: We propose the first end-to-end offline multi-agent preference learning framework. It extends the theoretical connection between soft Q-functions and reward functions to multi-agent settings and integrates a provably sound multi-agent value decomposition mechanism, enabling joint preference modeling and cooperative policy optimization. Our approach adapts key ideas from preference learning, soft Q-learning, and centralized training with decentralized execution (e.g., VDN/QMIX) to the offline setting. Contribution/Results: Evaluated on SMAC and MAMuJoCo benchmarks, our method outperforms state-of-the-art methods, achieving up to 2.3× improvement in sample efficiency while enhancing both policy coordination and training stability.

Technology Category

Application Category

📝 Abstract
Inferring reward functions from demonstrations is a key challenge in reinforcement learning (RL), particularly in multi-agent RL (MARL), where large joint state-action spaces and complex inter-agent interactions complicate the task. While prior single-agent studies have explored recovering reward functions and policies from human preferences, similar work in MARL is limited. Existing methods often involve separate stages of supervised reward learning and MARL algorithms, leading to unstable training. In this work, we introduce a novel end-to-end preference-based learning framework for cooperative MARL, leveraging the underlying connection between reward functions and soft Q-functions. Our approach uses a carefully-designed multi-agent value decomposition strategy to improve training efficiency. Extensive experiments on SMAC and MAMuJoCo benchmarks show that our algorithm outperforms existing methods across various tasks.
Problem

Research questions and friction points this paper is trying to address.

Multi-Agent Learning
Reward Inference
Offline Environment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-agent Reinforcement Learning
Reward Learning
Soft Q-learning
🔎 Similar Papers