🤖 AI Summary
This study investigates whether small language models (LLMs) can acquire robust, generalizable theory of mind (ToM) capabilities through reward-regularized reinforcement learning (RL) grounded in verifiable rewards. Models are trained on ToM-focused benchmarks—including HiToM, ExploreToM, and FANToM—and rigorously evaluated on out-of-distribution (OOD) tasks, notably OpenToM, to assess cross-task transfer. Results show consistent improvements on in-distribution tasks but no generalization—or even significant degradation—on OOD evaluation, indicating that models learn superficial statistical correlations rather than abstract, compositional ToM representations. This work provides the first systematic empirical demonstration of a fundamental limitation in current reward-based RL paradigms for modeling ToM in small-scale LLMs. It delivers critical evidence and methodological caution regarding the interpretability, reliability, and evaluation of social intelligence in foundation models, highlighting the need for more semantically grounded learning frameworks beyond reward optimization alone.
📝 Abstract
Recent advancements in large language models (LLMs) have demonstrated emergent capabilities in complex reasoning, largely spurred by rule-based Reinforcement Learning (RL) techniques applied during the post-training. This has raised the question of whether similar methods can instill more nuanced, human-like social intelligence, such as a Theory of Mind (ToM), in LLMs. This paper investigates whether small-scale LLMs can acquire a robust and generalizable ToM capability through RL with verifiable rewards (RLVR). We conduct a systematic evaluation by training models on various combinations of prominent ToM datasets (HiToM, ExploreToM, FANToM) and testing for generalization on held-out datasets (e.g., OpenToM). Our findings indicate that small LLMs struggle to develop a generic ToM capability. While performance on in-distribution tasks improves, this capability fails to transfer to unseen ToM tasks with different characteristics. Furthermore, we demonstrate that prolonged RL training leads to models ``hacking'' the statistical patterns of the training datasets, resulting in significant performance gains on in-domain data but no change, or degradation of performance on out-of-distribution tasks. This suggests the learned behavior is a form of narrow overfitting rather than the acquisition of a true, abstract ToM capability.