Sotopia-RL: Reward Design for Social Intelligence

📅 2025-08-05
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Standard MDP-based reinforcement learning struggles with credit assignment and training instability in social interaction modeling due to partial observability (e.g., delayed, indirect effects of utterances) and multi-dimensionality (e.g., implicit goals like rapport building and knowledge acquisition), especially when relying on single-dimensional, episode-level rewards. Method: We propose Sotopia-RL—a novel RL framework that decomposes coarse-grained, episode-level human feedback into fine-grained, turn-level, multi-dimensional reward signals, coupled with a fine-grained credit assignment mechanism, enabling efficient and stable LLM social intelligence training within the open-ended social simulation environment Sotopia. Contribution/Results: Evaluated on Sotopia-hard and Sotopia-full benchmarks, Sotopia-RL achieves social goal completion scores of 7.17 and 8.31, respectively—significantly surpassing prior methods and establishing new state-of-the-art performance.

Technology Category

Application Category

📝 Abstract
Social intelligence has become a critical capability for large language models (LLMs), enabling them to engage effectively in real-world social tasks such as accommodation, persuasion, collaboration, and negotiation. Reinforcement learning (RL) is a natural fit for training socially intelligent agents because it allows models to learn sophisticated strategies directly through social interactions. However, social interactions have two key characteristics that set barriers for RL training: (1) partial observability, where utterances have indirect and delayed effects that complicate credit assignment, and (2) multi-dimensionality, where behaviors such as rapport-building or knowledge-seeking contribute indirectly to goal achievement. These characteristics make Markov decision process (MDP)-based RL with single-dimensional episode-level rewards inefficient and unstable. To address these challenges, we propose Sotopia-RL, a novel framework that refines coarse episode-level feedback into utterance-level, multi-dimensional rewards. Utterance-level credit assignment mitigates partial observability by attributing outcomes to individual utterances, while multi-dimensional rewards capture the full richness of social interactions and reduce reward hacking. Experiments in Sotopia, an open-ended social learning environment, demonstrate that Sotopia-RL achieves state-of-the-art social goal completion scores (7.17 on Sotopia-hard and 8.31 on Sotopia-full), significantly outperforming existing approaches. Ablation studies confirm the necessity of both utterance-level credit assignment and multi-dimensional reward design for RL training. Our implementation is publicly available at: https://github.com/sotopia-lab/sotopia-rl.
Problem

Research questions and friction points this paper is trying to address.

Designing rewards for social intelligence in LLMs
Addressing partial observability in social interactions
Handling multi-dimensionality in social behavior rewards
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utterance-level credit assignment for partial observability
Multi-dimensional rewards for social interactions
Refines episode-level feedback into utterance-level rewards
🔎 Similar Papers
No similar papers found.