Adaptive Robust Estimator for Multi-Agent Reinforcement Learning

📅 2026-03-23

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

This work addresses the challenges of credit assignment and unstable policy optimization in multi-agent collaborative reasoning, which are often exacerbated by interaction ambiguity and heavy-tailed noisy rewards. To this end, we propose a robust multi-agent reinforcement learning framework that innovatively integrates a three-stage collaborative reasoning pipeline—comprising answer, critique, and rewrite phases—with an adaptive robust estimator. This design enables clear credit assignment and enhances robustness against reward noise during advantage function estimation and policy optimization. Experimental results demonstrate that our method consistently outperforms existing baselines across both homogeneous and heterogeneous settings on mathematical reasoning and embodied intelligence benchmarks, achieving superior performance with markedly improved training stability.

Technology Category

Application Category

📝 Abstract

Multi-agent collaboration has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models, yet it suffers from interaction-level ambiguity that blurs generation, critique, and revision, making credit assignment across agents difficult. Moreover, policy optimization in this setting is vulnerable to heavy-tailed and noisy rewards, which can bias advantage estimation and trigger unstable or even divergent training. To address both issues, we propose a robust multi-agent reinforcement learning framework for collaborative reasoning, consisting of two components: Dual-Agent Answer-Critique-Rewrite (DACR) and an Adaptive Robust Estimator (ARE). DACR decomposes reasoning into a structured three-stage pipeline: answer, critique, and rewrite, while enabling explicit attribution of each agent's marginal contribution to its partner's performance. ARE provides robust estimation of batch experience means during multi-agent policy optimization. Across mathematical reasoning and embodied intelligence benchmarks, even under noisy rewards, our method consistently outperforms the baseline in both homogeneous and heterogeneous settings. These results indicate stronger robustness to reward noise and more stable training dynamics, effectively preventing optimization failures caused by noisy reward signals.

Problem

Research questions and friction points this paper is trying to address.

multi-agent reinforcement learning

credit assignment

reward noise

training instability

collaborative reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-Agent Reinforcement Learning

Robust Estimation

Collaborative Reasoning