🤖 AI Summary
To address the low policy-learning efficiency in multi-agent reinforcement learning (MARL) for cooperative games and the high computational cost induced by online large language model (LLM) inference, this paper proposes a “single-LLM-call” paradigm. It leverages an LLM offline—once only—to jointly perform high-level task planning, state abstraction parsing, and generation of a differentiable planning function, which then guides the training of lightweight distributed policy networks (based on MAPPO or VDN). Crucially, the LLM is excluded from online decision-making, eliminating runtime inference overhead. Evaluated across three cooperative benchmark environments, our approach improves policy performance by 12%–37% over state-of-the-art MARL methods, reduces LLM calls during training by over 99%, and requires zero LLM involvement during inference—achieving both high efficiency and deployment friendliness.
📝 Abstract
Advancements in deep multi-agent reinforcement learning (MARL) have positioned it as a promising approach for decision-making in cooperative games. However, it still remains challenging for MARL agents to learn cooperative strategies for some game environments. Recently, large language models (LLMs) have demonstrated emergent reasoning capabilities, making them promising candidates for enhancing coordination among the agents. However, due to the model size of LLMs, it can be expensive to frequently infer LLMs for actions that agents can take. In this work, we propose You Only LLM Once for MARL (YOLO-MARL), a novel framework that leverages the high-level task planning capabilities of LLMs to improve the policy learning process of multi-agents in cooperative games. Notably, for each game environment, YOLO-MARL only requires one time interaction with LLMs in the proposed strategy generation, state interpretation and planning function generation modules, before the MARL policy training process. This avoids the ongoing costs and computational time associated with frequent LLMs API calls during training. Moreover, the trained decentralized normal-sized neural network-based policies operate independently of the LLM. We evaluate our method across three different environments and demonstrate that YOLO-MARL outperforms traditional MARL algorithms.