SMAC-R1: The Emergence of Intelligence in Decision-Making Tasks

📅 2024-10-21
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
To address poor policy interpretability, high environment interaction costs, and weak cross-task transferability in multi-agent reinforcement learning (MARL), this paper proposes an LLM-driven framework for interpretable decision tree generation and knowledge distillation. Methodologically, it introduces the first LLM-based self-generation and self-reflection mechanism for decision trees; establishes a closed-loop training paradigm comprising behavioral cloning, script augmentation, small-model distillation, and GRPO-based reinforcement; and employs DeepSeek-Coder-v2.5-236B to generate executable decision tree code, while Qwen2.5-7B-Base is fine-tuned via supervised fine-tuning (SFT) and GRPO. Evaluated on all 23 SMAC benchmarks and 10 novel tasks, the framework produces high-quality, executable decision trees with minimal environment interactions. It achieves zero-shot transfer to isomorphic environments, yielding an average win-rate improvement of 27.6%, thereby simultaneously advancing interpretability, sample efficiency, and generalization.

Technology Category

Application Category

📝 Abstract
StarCraft Multi-Agent Challenge (SMAC) has been one of the most commonly used experimental environments in multi-agent reinforcement learning (MARL), where the specific task is to control a set number of allied units to defeat enemy forces. Traditional MARL algorithms often require interacting with the environment for millions of steps to train a parametric model, of which the resulting policies are typically non-interpretable with weak transferability. In this paper, we introduce SMAC-R1 which is based on the Qwen2.5-7B-Base LLM distilled from DeepSeek-Coder-v2.5-236B. Similar to online reinforcement learning after behavior cloning in offline learning process, in our pipeline, agents leverage the DeepSeek LLM to generate decision tree code by providing task descriptions, and the agents are further self-reflected using feedback from the rewards provided by the environment. Based on that, we augment the generated scripts to fine-tune a small LLM, Qwen2.5-7B-Base, to distill the decision-making ability via Supervised Fine-Tuning (SFT) and enhance the script generation ability by the Group Relative Policy Optimization (GRPO) algorithm. We conduct experiments in the original 23 SMAC tasks and 10 newly-designed tasks to demonstrate that our method can produce high-quality, interpretable decision trees with minimal environmental exploration. Moreover, these scripts exhibit strong transferability, successfully applying to homogeneous SMAC environments without modification. We believe this approach offers a new direction for solving decision-making tasks and domain-specific LLM training pipelines in the future.
Problem

Research questions and friction points this paper is trying to address.

Enhance decision-making in multi-agent reinforcement learning tasks.
Generate interpretable and transferable decision trees using LLMs.
Reduce environmental exploration needed for effective policy training.
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM generates decision tree code
Self-reflection using environmental feedback
Fine-tuning LLM with GRPO algorithm
🔎 Similar Papers
No similar papers found.
Y
Yue Deng
College of Computer Science and Technology, Zhejiang University
Weiyu Ma
Weiyu Ma
KAUST
reinforcement learningartificial intelligence
Yuxin Fan
Yuxin Fan
University of Pennsylvania
Machine LearningAIFinance
Y
Yin Zhang
College of Computer Science and Technology, Zhejiang University
H
Haifeng Zhang
Institute of Automation, Chinese Academy of Sciences
J
Jian Zhao
Polixir.ai