Agent-ToM: Learning to Monitor Autonomous LLM Agents via Theory-of-Mind Reasoning

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

Monitoring covert malicious behaviors in autonomous large language model agents is highly challenging due to their delayed manifestation, context dependence, and long-horizon attack patterns. This work proposes a Theory of Mind (ToM)-based monitoring framework that infers agents’ beliefs and intentions, detects deviations from task-aligned behavior, and constructs reusable cross-turn semantic guardrails through an “infer–verify–refine” trajectory analysis pipeline. Moving beyond conventional approaches that treat interaction trajectories in isolation, the method integrates structured whole-trajectory analysis, confidence calibration, and a two-stage reasoning pipeline. Evaluated on benchmarks such as SHADE-Arena, the framework significantly outperforms existing monitoring techniques, achieving a superior trade-off between precision and recall.

📝 Abstract

Monitoring autonomous large language model (LLM) agents for covert malicious behavior is challenging due to delayed, context-dependent, and long-horizon attack patterns. Agents may pursue hidden objectives while maintaining superficially benign behavior, making detection difficult even with full trajectory access. Prior monitoring approaches improve scaffolding or ensemble aggregation, but treat each trajectory independently and do not learn from prior monitoring experience. Moreover, standard reasoning methods explain observed behavior without explicitly reasoning about agent beliefs, intentions, and goal alignment required to distinguish benign task execution from covert deviation. We propose \textbf{Agent-ToM}, a learning-to-monitor framework grounded in Theory-of-Mind (ToM) reasoning for security analysis of autonomous agents. Agent-ToM performs structured full-trajectory analysis by inferring beliefs, intent hypotheses with calibrated confidence, expected actions, and deviations from task-consistent behavioral baselines. At inference time, it employs a \textit{Reason-Verify-Refine} pipeline to construct and validate monitoring decisions. At training time, Agent-ToM distills critique signals into a persistent \textit{semantic guardrail memory}, enabling reusable belief- and intent-conditioned constraints across episodes. We evaluate Agent-ToM on adversarial agent monitoring benchmarks (SHADE-Arena and CUA-SHADE-Arena). Agent-ToM achieves strong precision-recall balance and outperforms state-of-the-art monitoring baselines, including ensemble methods, while using a coherent two-call reasoning pipeline. These results demonstrate that learning at the monitoring layer, combined with structured ToM reasoning and verification, provides an effective and deployable foundation for securing autonomous LLM agents.

Problem

Research questions and friction points this paper is trying to address.

autonomous LLM agents

covert malicious behavior

Theory-of-Mind reasoning

agent monitoring

goal alignment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Theory-of-Mind reasoning

autonomous LLM agents

semantic guardrail memory