🤖 AI Summary
Credit assignment—the quantification of individual contributions in multi-agent systems—remains a fundamental challenge. Method: This paper decouples credit assignment into two pattern recognition subtasks: sequential improvement identification and causal attribution, and proposes the first centralized reward critic framework powered by large language models (LLMs). It introduces two novel paradigms: LLM-MCA (LLM-based Multi-Agent Credit Assignment) and LLM-TACA (LLM-based Task-Aware Explicit Credit Allocation), enabling fine-grained, interpretable individual contribution assessment and task-level objective propagation. Additionally, it constructs the first large-scale cooperative trajectory dataset with step-level individual reward annotations. Results: Evaluated on Level-Based Foraging, Robotic Warehouse, and a newly introduced Spaceworld benchmark incorporating collision-avoidance safety constraints, the approach significantly outperforms state-of-the-art methods, accelerating policy convergence and enhancing team coordination performance.
📝 Abstract
Recent work, spanning from autonomous vehicle coordination to in-space assembly, has shown the importance of learning collaborative behavior for enabling robots to achieve shared goals. A common approach for learning this cooperative behavior is to utilize the centralized-training decentralized-execution paradigm. However, this approach also introduces a new challenge: how do we evaluate the contributions of each agent's actions to the overall success or failure of the team. This credit assignment problem has remained open, and has been extensively studied in the Multi-Agent Reinforcement Learning literature. In fact, humans manually inspecting agent behavior often generate better credit evaluations than existing methods. We combine this observation with recent works which show Large Language Models demonstrate human-level performance at many pattern recognition tasks. Our key idea is to reformulate credit assignment to the two pattern recognition problems of sequence improvement and attribution, which motivates our novel LLM-MCA method. Our approach utilizes a centralized LLM reward-critic which numerically decomposes the environment reward based on the individualized contribution of each agent in the scenario. We then update the agents' policy networks based on this feedback. We also propose an extension LLM-TACA where our LLM critic performs explicit task assignment by passing an intermediary goal directly to each agent policy in the scenario. Both our methods far outperform the state-of-the-art on a variety of benchmarks, including Level-Based Foraging, Robotic Warehouse, and our new Spaceworld benchmark which incorporates collision-related safety constraints. As an artifact of our methods, we generate large trajectory datasets with each timestep annotated with per-agent reward information, as sampled from our LLM critics.