SafeClaw-R: Towards Safe and Secure Multi-Agent Personal Assistants

📅 2026-03-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses critical security risks—such as reasoning errors and prompt injection attacks—faced by large language model (LLM)-driven multi-agent systems when executing cross-platform tasks. The authors propose a novel approach that embeds safety as a system-level invariant within the execution graph’s foundation, overcoming the limitations of static guardrails and post-hoc detection. By integrating runtime mediation, real-time interception of high-risk actions, and dynamic substitution with safety-enhanced skills, the framework ensures robust operational integrity. Key innovations include a secure execution graph mediation mechanism, safety-augmented skill modules, and adversarial behavior detection. Evaluated in a Google Workspace environment, the system achieves 95.2% task accuracy, detects 97.8% of malicious third-party skills, and attains 100% detection accuracy on adversarial code execution benchmarks.
📝 Abstract
LLM-based multi-agent systems (MASs) are transforming personal productivity by autonomously executing complex, cross-platform tasks. Frameworks such as OpenClaw demonstrate the potential of locally deployed agents integrated with personal data and services, but this autonomy introduces significant safety and security risks. Unintended actions from LLM reasoning failures can cause irreversible harm, while prompt injection attacks may exfiltrate credentials or compromise the system. Our analysis shows that 36.4% of OpenClaw's built-in skills pose high or critical risks. Existing approaches, including static guardrails and LLM-as-a-Judge, lack reliable real-time enforcement and consistent authority in MAS settings. To address this, we propose SafeClaw-R, a framework that enforces safety as a system-level invariant over the execution graph by ensuring that actions are mediated prior to execution, and systematically augments skills with safe counterparts. We evaluate SafeClaw-R across three representative domains: productivity platforms, third-party skill ecosystems, and code execution environments. SafeClaw-R achieves 95.2% accuracy in Google Workspace scenarios, significantly outperforming regex baselines (61.6%), detects 97.8% of malicious third-party skill patterns, and achieves 100% detection accuracy in our adversarial code execution benchmark. These results demonstrate that SafeClaw-R enables practical runtime enforcement for autonomous MASs.
Problem

Research questions and friction points this paper is trying to address.

multi-agent systems
safety
security
prompt injection
LLM reasoning failures
Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-agent systems
safety enforcement
runtime mediation
LLM security
skill augmentation
🔎 Similar Papers
No similar papers found.