🤖 AI Summary
This study addresses the challenge of credit assignment in multi-turn social dialogues—specifically, how to fairly attribute individual utterances to overall interaction outcomes. The authors propose a novel framework that integrates cooperative game theory with reinforcement learning, uniquely combining Shapley values with forward-looking expected utility to enable fair, axiomatically grounded (satisfying efficiency, symmetry, and marginality), and interpretable reward attribution, thereby overcoming the limitations of retrospective evaluation. Experimental results demonstrate that a 7B-parameter language model fine-tuned with this framework achieves a new state-of-the-art performance on the SOTOPIA benchmark, matching or even surpassing GPT-4o and Claude-3.5-Sonnet, suggesting that social intelligence relies on mechanisms distinct from analytical reasoning.
📝 Abstract
Social intelligence, the ability to navigate complex interpersonal interactions, presents a fundamental challenge for language agents. Training such agents via reinforcement learning requires solving the credit assignment problem: determining how individual utterances contribute to multi-turn dialogue outcomes. Existing approaches directly employ language models to distribute episode-level rewards, yielding attributions that are retrospective and lack theoretical grounding. We propose SAVOIR (ShApley Value fOr SocIal RL), a novel principled framework grounded in cooperative game theory. Our approach combines two complementary principles: expected utility shifts evaluation from retrospective attribution to prospective valuation, capturing an utterance's strategic potential for enabling favorable future trajectories; Shapley values ensure fair credit distribution with axiomatic guarantees of efficiency, symmetry, and marginality. Experiments on the SOTOPIA benchmark demonstrate that SAVOIR achieves new state-of-the-art performance across all evaluation settings, with our 7B model matching or exceeding proprietary models including GPT-4o and Claude-3.5-Sonnet. Notably, even large reasoning models consistently underperform, suggesting social intelligence requires qualitatively different capabilities than analytical reasoning.