TRACER: Turn-level Regret Matching with Inner Reinforcement Credit for Cooperative Multi-LLM Reasoning

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Existing approaches to integrating reinforcement learning with multi-agent prompting for collaborative large language model reasoning face challenges such as sparse rewards, free-riding among roles, high training overhead, and convergence to suboptimal equilibria due to fixed communication protocols. This work proposes TRACER, a novel framework that decouples collaborative decision-making into a controller–regret layer—which determines whether an agent should speak via round-level regret matching—and a generation–credit layer that optimizes utterance content using role-specific Generalized Strategy Policy Optimization (GSPO). TRACER is the first deep learning framework to rigorously extend classical game-theoretic concepts through binary action design, ensuring mathematical convergence while enabling dual credit assignment at both action-mode and utterance levels. Evaluated on GSM8K (training) and tested across GSM8K, MATH500, and GPQA-Diamond, TRACER significantly improves in-domain accuracy and cross-benchmark generalization, reduces inference cost, and maintains robust self-correction behavior.

📝 Abstract

Large language models increasingly rely on either reinforcement learning or multi-agent prompting to improve reasoning, yet these two paradigms remain difficult to combine. Directly applying single-agent reinforcement learning to multi-turn multi-agent systems faces following dilemmas: i) Sparse rewards, role-level free-riding and excessive training overhead. ii) Agents only imitate to collaborate. iii) Fixed collaboration protocol falls into oscillating local optimum. We introduce TRACER, a turn-level reinforcement framework for cooperative multi-LLM reasoning. TRACER separates collaborative decision making into a controller-regret layer, where controllers learn whether the agents should speak or skip the current round through regret matching, and a generation-credit layer, which optimizes proposer and reviewer utterances with role-specific GSPO rewards. This design i) assigns credit at the level of both action modes and generated utterances, thus avoiding free-riding and sparse rewards. We only expand the choices made by the controllers, thus greatly reducing computational cost of training. Moreover, ii) agents acquire collaborative capability as they learn when to utter and what to speak. Finally, iii) by designing binary actions ingeniously, we extend classical game theory established for finite action spaces to deep learning, thus achieving mathematically rigorous convergence. We train all local RL-style methods on the GSM8K training split and evaluate on held-out GSM8K, MATH500, and GPQA-Diamond to measure in-domain accuracy, cross-benchmark generalization, inference cost, and correction-preservation behavior. The resulting framework provides a compact and reproducible testbed for studying learned collaboration policies beyond fixed debate, voting, or aggregation protocols. Code is available at https://github.com/Shark-Forest/TRACER.

Problem

Research questions and friction points this paper is trying to address.

cooperative multi-LLM reasoning

reinforcement learning

multi-agent prompting

sparse rewards

collaboration protocol

Innovation

Methods, ideas, or system contributions that make the work stand out.

turn-level reinforcement learning

regret matching

credit assignment