π€ AI Summary
This work proposes SocialMindChange, a novel benchmark that shifts theory of mind (ToM) evaluation from passive observation to active intervention in social interactions. Building upon a structured four-step framework, the benchmark introduces a large-scale multi-agent dialogue dataset comprising 1,200 social scenarios, 6,000 scenes, and over 90,000 questions, requiring models to generate goal-directed utterances across five sequential turns involving four interlocutors. The task demands not only the active shaping of othersβ mental states but also the maintenance of global consistency throughout extended dialogues. Evaluation across ten state-of-the-art large language models reveals a significant performance gap, with models scoring on average 54.2% below human levels, highlighting their current limitations in modulating higher-order mental states during long-horizon social interactions.
π Abstract
Existing dynamic Theory of Mind (ToM) benchmarks mostly place language models in a passive role: the model reads a sequence of connected scenarios and reports what people believe, feel, intend, and do as these states change. In real social interaction, ToM is also used for action: a speaker plans what to say in order to shift another person's mental-state trajectory toward a goal. We introduce SocialMindChange, a benchmark that moves from tracking minds to changing minds in social interaction. Each instance defines a social context with 4 characters and five connected scenes. The model plays one character and generates dialogue across the five scenes to reach the target while remaining consistent with the evolving states of all participants. SocialMindChange also includes selected higher-order states. Using a structured four-step framework, we construct 1,200 social contexts, covering 6000 scenarios and over 90,000 questions, each validated for realism and quality. Evaluations on ten state-of-the-art LLMs show that their average performance is 54.2% below human performance. This gap suggests that current LLMs still struggle to maintain and change mental-state representations across long, linked interactions.