Who Gets the Reward, Who Gets the Blame? Evaluation-Aligned Training Signals for Multi-LLM Agents

📅 2025-11-11

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

In multi-agent LLM systems, collaborative training lacks a theoretical foundation for decomposing global task evaluation into agent-level and message-level supervision signals. Method: We propose a unified framework integrating cooperative game-theoretic attribution—based on bounded, signed, and conserved Shapley value credit assignment—with process reward modeling (PRM), enabling auditable mapping from system-level evaluation to local learning: rewarding agents fairly in successful trajectories and precisely identifying the first erroneous step in failures, with repairable corrective signals. Contribution/Results: The framework is compatible with both preference learning and reinforcement learning, yielding interpretable and verifiable training signals. Its core innovation lies in establishing theoretical consistency and practical feasibility of global-to-local credit assignment, thereby providing the first principled, scalable signal-generation paradigm for optimizing collaborative multi-LLM agent systems.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) in multi-agent systems (MAS) have shown promise for complex tasks, yet current training methods lack principled ways to connect system-level evaluation with agent-level and message-level learning. We propose a theoretical framework that unifies cooperative game-theoretic attribution with process reward modeling to transform system evaluation into agent credit and then into response-level signals. Unlike prior approaches that rely only on attribution (e.g., Shapley) or step-level labels (e.g., PRM), our method produces local, signed, and credit-conserving signals. In success cases, Shapley-based credit assignment fairly allocates outcomes across agents and is refined into per-message rewards that promote cooperation while discouraging redundancy or sabotage. In failure cases, first-error localization yields repair-aware preferences that penalize harmful steps while rewarding corrective attempts. The resulting signals are bounded, cooperative, and directly compatible with reinforcement-based or preference-based post-training, providing a unified and auditable pathway from global evaluation to local supervision in LLM multi-agent training. Our contribution is conceptual: we present a theoretical foundation and training signals, leaving empirical validation for future work.

Problem

Research questions and friction points this paper is trying to address.

Linking system-level evaluation to agent-level learning in multi-agent systems

Transforming global evaluation into local training signals for LLMs

Providing auditable credit assignment from outcomes to individual messages

Innovation

Methods, ideas, or system contributions that make the work stand out.

Unifies game-theoretic attribution with process reward modeling

Produces local signed credit-conserving training signals

Transforms system evaluation into agent-level response supervision

🔎 Similar Papers

Self-Rewarding Language Models