DLM: Unified Decision Language Models for Offline Multi-Agent Sequential Decision Making

📅 2026-04-26
📈 Citations: 0
Influential: 0
📄 PDF

career value

187K/year
🤖 AI Summary
This work addresses the limited policy generalization and difficulties in handling heterogeneous observation and action spaces in offline multi-agent reinforcement learning (MARL). The authors propose modeling multi-agent decision-making as a conversational sequence prediction task, introducing a language model as a unified interface within a centralized training and decentralized execution framework. This approach naturally accommodates heterogeneous inputs and enables zero-shot generalization across tasks. The method combines supervised fine-tuning with group relative policy optimization, trained on conversational datasets, and incorporates a lightweight reward function to enhance robustness to out-of-distribution actions. Experiments demonstrate that the proposed approach significantly outperforms existing offline MARL and LLM-based decision-making methods across multiple benchmarks, exhibiting strong zero-shot transfer and task generalization capabilities.

Technology Category

Application Category

📝 Abstract
Building scalable and reusable multi-agent decision policies from offline datasets remains a challenge in offline multi-agent reinforcement learning (MARL), as existing methods often rely on fixed observation formats and action spaces that limit generalization. In contrast, large language models (LLMs) offer a flexible modeling interface that can naturally accommodate heterogeneous observations and actions. Motivated by this, we propose the Decision Language Model (DLM), which formulates multi-agent decision making as a dialogue-style sequence prediction problem under the centralized training with decentralized execution paradigm. DLM is trained in two stages: a supervised fine-tuning phase, which leverages dialogue-style datasets for centralized training with inter-agent context and generates executable actions from offline trajectories, followed by a group relative policy optimization phase to enhance robustness to out-of-distribution actions through lightweight reward functions. Experiments on multiple benchmarks show that a unified DLM outperforms strong offline MARL baselines and LLM-based conversational decision-making methods, while demonstrating strong zero-shot generalization to unseen scenarios across tasks.
Problem

Research questions and friction points this paper is trying to address.

offline multi-agent reinforcement learning
decision policies
generalization
heterogeneous observations
action spaces
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decision Language Model
Offline Multi-Agent Reinforcement Learning
Dialogue-Style Sequence Prediction
Zero-Shot Generalization
Group Relative Policy Optimization
🔎 Similar Papers
2024-06-17Conference on Empirical Methods in Natural Language ProcessingCitations: 3