Theory of Mind and Self-Attributions of Mentality are Dissociable in LLMs

📅 2026-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether safety fine-tuning impairs large language models’ theory of mind (ToM) by suppressing their self-attribution of mental states. Through ablation experiments on safety fine-tuning, representational similarity analysis, and ToM evaluation tasks, the work provides the first evidence that self-attribution of mental states and ToM are behaviorally and mechanistically dissociable in these models. The findings indicate that safety fine-tuning does not diminish overall ToM capabilities; however, it significantly reduces the models’ attribution of mental states to non-human animals and decreases expressions of spiritual beliefs. These results suggest that current alignment practices may inadvertently constrain models’ capacity to represent diverse forms of mindedness beyond human-centric perspectives.
📝 Abstract
Safety fine-tuning in Large Language Models (LLMs) seeks to suppress potentially harmful forms of mind-attribution such as models asserting their own consciousness or claiming to experience emotions. We investigate whether suppressing mind-attribution tendencies degrades intimately related socio-cognitive abilities such as Theory of Mind (ToM). Through safety ablation and mechanistic analyses of representational similarity, we demonstrate that LLM attributions of mind to themselves and to technological artefacts are behaviorally and mechanistically dissociable from ToM capabilities. Nevertheless, safety fine-tuned models under-attribute mind to non-human animals relative to human baselines and are less likely to exhibit spiritual belief, suppressing widely shared perspectives regarding the distribution and nature of non-human minds.
Problem

Research questions and friction points this paper is trying to address.

Theory of Mind
Self-Attribution
Mind Attribution
Safety Fine-tuning
Large Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Theory of Mind
mind-attribution
safety fine-tuning
representational similarity
mechanistic dissociation
🔎 Similar Papers
No similar papers found.
Junsol Kim
Junsol Kim
University of Chicago
computational social scienceartificial intelligencecollective intelligencesocial network
W
Winnie Street
Google, Paradigms of Intelligence Team; Institute of Philosophy, School of Advanced Study, University of London
R
Roberta Rocca
Google, Paradigms of Intelligence Team
D
Daine M. Korngiebel
Department of Biomedical Informatics and Medical Education and Department of Bioethics and Humanities, School of Medicine, University of Washington; Work done while at Google
Adam Waytz
Adam Waytz
Northwestern University
social cognitionethics and morality
James Evans
James Evans
Max Palevsky Professor of Sociology & Data Science, University of Chicago
science of scienceinnovationsociology of knowledgeartificial intelligencedeep learning
G
Geoff Keeling
Google, Paradigms of Intelligence Team; Institute of Philosophy, School of Advanced Study, University of London