Enhancing LLM Agent Safety via Causal Influence Prompting

📅 2025-07-01

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Large language model (LLM)-driven autonomous agents pose safety risks in high-stakes domains—e.g., code execution and mobile device control—due to uncontrolled or unintended actions. Method: We propose Causal Influence Prompting (CIP), a parameter-free safety enhancement framework that dynamically models agent decision logic via causal influence diagrams (CIDs), explicitly encoding action–outcome causal pathways. CIP integrates task-driven initialization, environment-guided interaction, and iterative CID refinement to enable real-time identification and intervention on hazardous causal paths. It further supports continuous improvement via user feedback without model fine-tuning. Contribution/Results: Experiments across multiple benchmarks demonstrate that CIP significantly improves agent safety by effectively preventing privilege escalation, infinite loops, and privacy leakage. Unlike black-box mitigation strategies, CIP provides human-interpretable causal explanations and actionable intervention points, establishing a novel paradigm for building trustworthy, explainable, and controllable autonomous agents.

Technology Category

Application Category

📝 Abstract

As autonomous agents powered by large language models (LLMs) continue to demonstrate potential across various assistive tasks, ensuring their safe and reliable behavior is crucial for preventing unintended consequences. In this work, we introduce CIP, a novel technique that leverages causal influence diagrams (CIDs) to identify and mitigate risks arising from agent decision-making. CIDs provide a structured representation of cause-and-effect relationships, enabling agents to anticipate harmful outcomes and make safer decisions. Our approach consists of three key steps: (1) initializing a CID based on task specifications to outline the decision-making process, (2) guiding agent interactions with the environment using the CID, and (3) iteratively refining the CID based on observed behaviors and outcomes. Experimental results demonstrate that our method effectively enhances safety in both code execution and mobile device control tasks.

Problem

Research questions and friction points this paper is trying to address.

Enhancing LLM agent safety via causal influence prompting

Mitigating risks from agent decision-making using CIDs

Improving safety in code execution and device control

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses causal influence diagrams (CIDs)

Guides agent interactions via CID

Iteratively refines CID for safety

🔎 Similar Papers

Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science