CoSToM:Causal-oriented Steering for Intrinsic Theory-of-Mind Alignment in Large Language Models

πŸ“… 2026-04-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

201K/year
πŸ€– AI Summary
This work addresses the limited intrinsic theory of mind (ToM) cognition in large language models, which hinders their ability to consistently generate high-quality social reasoning behaviors. The study pioneers a shift from causal intervention as a post-hoc explanatory tool to an active regulatory mechanism. By employing causal tracing to identify the specific network layers encoding ToM-related semantics, the authors implement targeted activation interventions at these layers, establishing a lightweight and directional ToM alignment framework. This approach effectively aligns the model’s internal ToM representations with its overt behavioral outputs, substantially enhancing generalization, stability, and downstream dialogue quality on complex ToM tasks, thereby advancing the model toward genuine, internally grounded social cognitive capabilities.

Technology Category

Application Category

πŸ“ Abstract
Theory of Mind (ToM), the ability to attribute mental states to others, is a hallmark of social intelligence. While large language models (LLMs) demonstrate promising performance on standard ToM benchmarks, we observe that they often fail to generalize to complex task-specific scenarios, relying heavily on prompt scaffolding to mimic reasoning. The critical misalignment between the internal knowledge and external behavior raises a fundamental question: Do LLMs truly possess intrinsic cognition, and can they externalize this internal knowledge into stable, high-quality behaviors? To answer this, we introduce CoSToM (Causal-oriented Steering for ToM alignment), a framework that transitions from mechanistic interpretation to active intervention. First, we employ causal tracing to map the internal distribution of ToM features, empirically uncovering the internal layers'characteristics in encoding fundamental ToM semantics. Building on this insight, we implement a lightweight alignment framework via targeted activation steering within these ToM-critical layers. Experiments demonstrate that CoSToM significantly enhances human-like social reasoning capabilities and downstream dialogue quality.
Problem

Research questions and friction points this paper is trying to address.

Theory of Mind
Large Language Models
Intrinsic Cognition
Behavioral Alignment
Generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Causal Tracing
Activation Steering
Theory of Mind
Intrinsic Alignment
Large Language Models