🤖 AI Summary
Large language model (LLM)-driven autonomous agents pose safety risks in high-stakes domains—e.g., code execution and mobile device control—due to uncontrolled or unintended actions. Method: We propose Causal Influence Prompting (CIP), a parameter-free safety enhancement framework that dynamically models agent decision logic via causal influence diagrams (CIDs), explicitly encoding action–outcome causal pathways. CIP integrates task-driven initialization, environment-guided interaction, and iterative CID refinement to enable real-time identification and intervention on hazardous causal paths. It further supports continuous improvement via user feedback without model fine-tuning. Contribution/Results: Experiments across multiple benchmarks demonstrate that CIP significantly improves agent safety by effectively preventing privilege escalation, infinite loops, and privacy leakage. Unlike black-box mitigation strategies, CIP provides human-interpretable causal explanations and actionable intervention points, establishing a novel paradigm for building trustworthy, explainable, and controllable autonomous agents.
📝 Abstract
As autonomous agents powered by large language models (LLMs) continue to demonstrate potential across various assistive tasks, ensuring their safe and reliable behavior is crucial for preventing unintended consequences. In this work, we introduce CIP, a novel technique that leverages causal influence diagrams (CIDs) to identify and mitigate risks arising from agent decision-making. CIDs provide a structured representation of cause-and-effect relationships, enabling agents to anticipate harmful outcomes and make safer decisions. Our approach consists of three key steps: (1) initializing a CID based on task specifications to outline the decision-making process, (2) guiding agent interactions with the environment using the CID, and (3) iteratively refining the CID based on observed behaviors and outcomes. Experimental results demonstrate that our method effectively enhances safety in both code execution and mobile device control tasks.