Overcoming Multi-step Complexity in Multimodal Theory-of-Mind Reasoning: A Scalable Bayesian Planner

📅 2025-06-02

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

Existing computational Theory of Mind (ToM) approaches suffer from poor generalization and limited scalability in multimodal settings. This paper proposes a stepwise Bayesian updating framework that decomposes complex, multi-step ToM reasoning into iterative, modular inference steps. We introduce a novel “weak-to-strong” collaborative control mechanism: a lightweight small model specializes in ToM likelihood estimation, and its capabilities are distilled and transferred to large language models (LLMs) ranging from 7B to 405B parameters—thereby aligning Bayesian inference principles with LLM-based social cognition. Our method integrates Bayesian probabilistic programming, multimodal representation learning, and small-model-guided collaborative reasoning with large models. On multimodal ToM benchmarks, it achieves a 4.6% absolute accuracy improvement and markedly enhances generalization to unseen, complex scenarios. This work establishes a scalable, interpretable paradigm for modeling higher-order human mental states.

Technology Category

Application Category

📝 Abstract

Theory-of-Mind (ToM) enables humans to infer mental states-such as beliefs, desires, and intentions-forming the foundation of social cognition. However, existing computational ToM methods rely on structured workflows with ToM-specific priors or deep model fine-tuning, which struggle with scalability in multimodal environments and fail to generalize as task complexity increases. To address these limitations, we propose a scalable Bayesian ToM planner that decomposes ToM reasoning into stepwise Bayesian updates. Our framework introduces weak-to-strong control, allowing smaller language models (LMs) to specialize in ToM-specific likelihood estimation and transfer their reasoning behaviors to larger LMs (7B to 405B) for integration with social and world knowledge. This synergistic approach aligns large-model inference of human mental states with Bayesian principles. Extensive experiments show that our method achieves a 4.6% accuracy improvement over state-of-the-art techniques on multimodal ToM benchmarks, including challenging unseen scenarios, thereby establishing a new standard for modeling human mental states in complex environments.

Problem

Research questions and friction points this paper is trying to address.

Scalable Bayesian planner for multimodal ToM reasoning

Weak-to-strong control for LM specialization in ToM

Improving accuracy in complex unseen scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Scalable Bayesian planner for ToM reasoning

Weak-to-strong control for model specialization

Stepwise Bayesian updates for mental state inference

🔎 Similar Papers

MuMA-ToM: Multi-modal Multi-Agent Theory of Mind

2024-08-22arXiv.orgCitations: 0