DialToM: A Theory of Mind Benchmark for Forecasting State-Driven Dialogue Trajectories

📅 2026-04-22
📈 Citations: 0
Influential: 0
📄 PDF

career value

183K/year
🤖 AI Summary
This study investigates whether current large language models genuinely possess robust Theory of Mind (ToM) reasoning capabilities or merely exploit superficial correlations. To address this, the authors introduce DialToM—a multiple-choice benchmark grounded in natural human dialogues—that uniquely incorporates a functional ToM evaluation dimension. By framing ToM as a forward-looking diagnostic prediction task, the benchmark assesses whether models can accurately infer subsequent dialogue trajectories based on descriptions of mental states. Combining human-validated conversational data with semantic similarity analyses, the approach reveals a marked asymmetry in models’ reasoning: while many can identify mental states at a surface level, they struggle to leverage this understanding for coherent dialogue predictions. Experimental results show that, with the exception of Gemini 3 Pro, most models exhibit weak semantic alignment between their predictions and human judgments.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have been shown to possess Theory of Mind (ToM) abilities. However, it remains unclear whether this stems from robust reasoning or spurious correlations. We introduce DialToM, a human-verified benchmark built from natural human dialogue using a multiple-choice framework. We evaluate not only mental state prediction (Literal ToM) but also the functional utility of these states (Functional ToM) through Prospective Diagnostic Forecasting -- probing whether models can identify state-consistent dialogue trajectories solely from mental-state profiles. Our results reveal a significant reasoning asymmetry: while LLMs excel at identifying mental states, most (except for Gemini 3 Pro) fail to leverage this understanding to forecast social trajectories. Additionally, we find only weak semantic similarities between human and LLM-generated inferences. To facilitate reproducibility, the DialToM dataset and evaluation code are publicly available at https://github.com/Stealth-py/DialToM.
Problem

Research questions and friction points this paper is trying to address.

Theory of Mind
Dialogue Forecasting
Large Language Models
Mental State Prediction
Prospective Diagnostic Forecasting
Innovation

Methods, ideas, or system contributions that make the work stand out.

Theory of Mind
Dialogue Benchmark
Prospective Diagnostic Forecasting
Functional ToM
LLM Reasoning
🔎 Similar Papers
No similar papers found.