LLM-ALSO: LLM-Driven Adaptive Learning-Signal Optimization for Multi-Agent Reinforcement Learning

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of inefficient coordination and learning in multi-agent reinforcement learning under sparse rewards, where the scarcity of informative feedback often necessitates extensive handcrafted design. To overcome this limitation, the authors propose LLM-ALSO, a novel framework that integrates large language models (LLMs) into reward shaping through an iterative closed-loop process. By leveraging stage-aware diagnosis, generation, and short-horizon validation mechanisms, LLM-ALSO dynamically produces reliable, behavior-evidence-driven reward signals. The approach uniquely employs the LLM in dual roles—as both Critic and Generator—combined with compact behavioral representations and a branching verification strategy. Evaluated across multiple cooperative tasks, the method demonstrates substantial improvements in both sample efficiency and final performance, confirming its effectiveness and robustness without relying on manual reward engineering.

📝 Abstract

Effective training-time guidance is central to multi-agent reinforcement learning (MARL), yet remains difficult in sparse-reward settings where weak supervision limits coordination and policy improvement, and existing methods often require substantial domain expertise or manual design effort. Large language models (LLMs) provide a promising alternative for flexible learning-signal design, yet existing LLM-based methods remain largely single-agent-oriented, one-shot, or weakly validated for the evolving training dynamics of cooperative MARL. To address these limitations, we propose LLM-ALSO, an iterative LLM-driven adaptive learning-signal optimization framework for MARL. Rather than directly deploying LLM-generated rewards, LLM-ALSO decomposes adaptation into iterative diagnosis, proposal, and validation: a Critic LLM diagnoses stage-specific learning and coordination failures from sparse-return metrics and compact behavior evidence, a Generator LLM proposes candidate reward-shaping configurations conditioned on the diagnosis, and branch-validation feedback refines candidates before they affect the main training trajectory. Through short-horizon validation and stage-aware adaptation, LLM-ALSO promotes only validated updates into training, reducing the risk of unreliable LLM-generated modifications. Experiments on sparse-reward cooperative MARL tasks show that LLM-ALSO improves sparse-evaluation performance and learning efficiency.

Problem

Research questions and friction points this paper is trying to address.

multi-agent reinforcement learning

sparse-reward

learning-signal optimization

coordination

training dynamics

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-driven

adaptive reward shaping

multi-agent reinforcement learning