Small LLMs Do Not Learn a Generalizable Theory of Mind via Reinforcement Learning

📅 2025-07-21

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

This study investigates whether small language models (LLMs) can acquire robust, generalizable theory of mind (ToM) capabilities through reward-regularized reinforcement learning (RL) grounded in verifiable rewards. Models are trained on ToM-focused benchmarks—including HiToM, ExploreToM, and FANToM—and rigorously evaluated on out-of-distribution (OOD) tasks, notably OpenToM, to assess cross-task transfer. Results show consistent improvements on in-distribution tasks but no generalization—or even significant degradation—on OOD evaluation, indicating that models learn superficial statistical correlations rather than abstract, compositional ToM representations. This work provides the first systematic empirical demonstration of a fundamental limitation in current reward-based RL paradigms for modeling ToM in small-scale LLMs. It delivers critical evidence and methodological caution regarding the interpretability, reliability, and evaluation of social intelligence in foundation models, highlighting the need for more semantically grounded learning frameworks beyond reward optimization alone.

Technology Category

Application Category

📝 Abstract

Recent advancements in large language models (LLMs) have demonstrated emergent capabilities in complex reasoning, largely spurred by rule-based Reinforcement Learning (RL) techniques applied during the post-training. This has raised the question of whether similar methods can instill more nuanced, human-like social intelligence, such as a Theory of Mind (ToM), in LLMs. This paper investigates whether small-scale LLMs can acquire a robust and generalizable ToM capability through RL with verifiable rewards (RLVR). We conduct a systematic evaluation by training models on various combinations of prominent ToM datasets (HiToM, ExploreToM, FANToM) and testing for generalization on held-out datasets (e.g., OpenToM). Our findings indicate that small LLMs struggle to develop a generic ToM capability. While performance on in-distribution tasks improves, this capability fails to transfer to unseen ToM tasks with different characteristics. Furthermore, we demonstrate that prolonged RL training leads to models ``hacking'' the statistical patterns of the training datasets, resulting in significant performance gains on in-domain data but no change, or degradation of performance on out-of-distribution tasks. This suggests the learned behavior is a form of narrow overfitting rather than the acquisition of a true, abstract ToM capability.

Problem

Research questions and friction points this paper is trying to address.

Investigates if small LLMs develop generalizable Theory of Mind via RL

Tests transfer of ToM learning to unseen tasks with varied characteristics

Reveals RL training causes narrow overfitting, not abstract ToM capability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Reinforcement Learning with verifiable rewards (RLVR)

Trains on multiple ToM datasets for evaluation

Tests generalization on held-out ToM datasets

🔎 Similar Papers

Entering Real Social World! Benchmarking the Social Intelligence of Large Language Models from a First-person Perspective

2024-10-08Citations: 0

Nvidia

30 USD - 94 USD

US, CA, Santa Clara

AI Research Scientist - FAIR Social Intelligence