🤖 AI Summary
This work addresses the challenges of credit assignment in multi-agent large language model (LLM) collaboration, where shared global rewards often lead to ambiguous credit attribution, high-variance policy updates, and free-riding behavior. To mitigate these issues, the authors propose a role-aware credit assignment mechanism that estimates each agent’s marginal contribution through counterfactual trajectory analysis. This approach constructs a role-sensitive advantage function and incorporates a dynamic counterfactual baseline along with a global history-aware normalization scheme. The proposed method achieves fine-grained and stable credit assignment—the first of its kind in multi-agent LLM collaboration—and effectively suppresses free-riding. Experimental results demonstrate significant performance gains over existing approaches on mathematical and logical reasoning tasks, with consistently more efficient and stable training observed in both sequential collaboration and multi-agent voting scenarios.
📝 Abstract
Collaborative multi-agent large language models (LLMs) can solve complex reasoning tasks by decomposing roles and aggregating diverse hypotheses. Yet, reinforcement learning (RL) for such systems is often undermined by credit assignment: a shared global reward obscures individual contributions, inflating update variance and encouraging free-riding. We introduce Counterfactual Credit Policy Optimization (CCPO), a framework that assigns agent-specific learning signals by estimating each agent's marginal contribution through counterfactual trajectories. CCPO builds dynamic counterfactual baselines that simulate outcomes with an agent's contribution removed, yielding role-sensitive advantages for policy optimization. To further improve stability under heterogeneous tasks and data distributions, we propose a global-history-aware normalization scheme that calibrates advantages using global rollout statistics. We evaluate CCPO on two collaboration topologies: a sequential Think--Reason dyad and multi-agent voting. Across mathematical and logical reasoning benchmarks, CCPO mitigates free-riding and outperforms strong multi-agent RL baselines, yielding finer-grained and more effective credit assignment for collaborative LLM training. Our code is available at https://github.com/bhai114/ccpo.