RecToM: A Benchmark for Evaluating Machine Theory of Mind in LLM-based Conversational Recommender Systems

📅 2025-11-27

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Existing Theory of Mind (ToM) benchmarks suffer from overreliance on static Sally-Anne–style narratives, neglect behavioral prediction capabilities, and fail to capture the dynamic complexity of real-world dialogue. To address this, we introduce the first machine ToM benchmark specifically designed for recommendation dialogues. Grounded in real user interaction data, it annotates latent mental states—such as intentions and beliefs—and constructs multi-turn, dynamic dialogue tasks evaluating both cognitive inference (intention recognition, belief updating) and behavioral prediction (strategy selection, decision consistency). Our contributions are threefold: (1) the first extension of ToM evaluation to authentic recommendation dialogue settings; (2) explicit modeling of how mental state reasoning guides subsequent conversational actions; and (3) the integration of behavioral prediction as a core evaluation dimension. Experiments reveal that while current large language models achieve moderate performance in mental state identification, they exhibit significant deficiencies in sustained intention tracking and strategy consistency.

Technology Category

Application Category

📝 Abstract

Large Language models are revolutionizing the conversational recommender systems through their impressive capabilities in instruction comprehension, reasoning, and human interaction. A core factor underlying effective recommendation dialogue is the ability to infer and reason about users' mental states (such as desire, intention, and belief), a cognitive capacity commonly referred to as Theory of Mind. Despite growing interest in evaluating ToM in LLMs, current benchmarks predominantly rely on synthetic narratives inspired by Sally-Anne test, which emphasize physical perception and fail to capture the complexity of mental state inference in realistic conversational settings. Moreover, existing benchmarks often overlook a critical component of human ToM: behavioral prediction, the ability to use inferred mental states to guide strategic decision-making and select appropriate conversational actions for future interactions. To better align LLM-based ToM evaluation with human-like social reasoning, we propose RecToM, a novel benchmark for evaluating ToM abilities in recommendation dialogues. RecToM focuses on two complementary dimensions: Cognitive Inference and Behavioral Prediction. The former focus on understanding what has been communicated by inferring the underlying mental states. The latter emphasizes what should be done next, evaluating whether LLMs can leverage these inferred mental states to predict, select, and assess appropriate dialogue strategies. Extensive experiments on state-of-the-art LLMs demonstrate that RecToM poses a significant challenge. While the models exhibit partial competence in recognizing mental states, they struggle to maintain coherent, strategic ToM reasoning throughout dynamic recommendation dialogues, particularly in tracking evolving intentions and aligning conversational strategies with inferred mental states.

Problem

Research questions and friction points this paper is trying to address.

Evaluating Theory of Mind in LLM-based conversational recommenders

Assessing mental state inference and behavioral prediction in dialogues

Benchmarking LLMs on strategic reasoning for recommendation interactions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces RecToM benchmark for evaluating Theory of Mind

Focuses on cognitive inference and behavioral prediction dimensions

Assesses LLMs' ability to use mental states for strategic dialogue

🔎 Similar Papers

No similar papers found.