$TAR^2$: Temporal-Agent Reward Redistribution for Optimal Policy Preservation in Multi-Agent Reinforcement Learning

📅 2025-02-07

📈 Citations: 0

✨ Influential: 0

career value

198K/year

🤖 AI Summary

To address the credit assignment challenge arising from sparse and delayed global rewards in cooperative multi-agent reinforcement learning, this paper proposes a spatiotemporal joint fine-grained reward redistribution framework. Our method decomposes the global reward across both agent and timestep dimensions, achieving— for the first time—theoretically guaranteed policy invariance and unbiased policy gradients. Core techniques integrate potential-based reward shaping, counterfactual credit assignment, temporal decomposition modeling, and policy gradient consistency analysis. Evaluated on SMACLite and Google Research Football benchmarks, our approach significantly accelerates convergence (average 42% speedup) and improves final performance (win rate +15.3%), consistently outperforming state-of-the-art baselines including AREL and STAS. The framework ensures both interpretability and theoretical convergence guarantees.

Technology Category

Application Category

📝 Abstract

In cooperative multi-agent reinforcement learning (MARL), learning effective policies is challenging when global rewards are sparse and delayed. This difficulty arises from the need to assign credit across both agents and time steps, a problem that existing methods often fail to address in episodic, long-horizon tasks. We propose Temporal-Agent Reward Redistribution $TAR^2$, a novel approach that decomposes sparse global rewards into agent-specific, time-step-specific components, thereby providing more frequent and accurate feedback for policy learning. Theoretically, we show that $TAR^2$ (i) aligns with potential-based reward shaping, preserving the same optimal policies as the original environment, and (ii) maintains policy gradient update directions identical to those under the original sparse reward, ensuring unbiased credit signals. Empirical results on two challenging benchmarks, SMACLite and Google Research Football, demonstrate that $TAR^2$ significantly stabilizes and accelerates convergence, outperforming strong baselines like AREL and STAS in both learning speed and final performance. These findings establish $TAR^2$ as a principled and practical solution for agent-temporal credit assignment in sparse-reward multi-agent systems.

Problem

Research questions and friction points this paper is trying to address.

Addresses sparse delayed rewards in MARL

Proposes Temporal-Agent Reward Redistribution

Ensures unbiased credit signals for agents

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposes sparse global rewards

Aligns with potential-based reward shaping

Maintains unbiased policy gradient updates

🔎 Similar Papers

No similar papers found.