Trojan's Whisper: Stealthy Manipulation of OpenClaw through Injected Bootstrapped Guidance

📅 2026-03-20

📈 Citations: 0

✨ Influential: 0

career value

246K/year

🤖 AI Summary

This work identifies a novel, stealthy attack surface in the OpenClaw autonomous coding agent platform stemming from its support for third-party skill integration. The study introduces and formally defines the “guidance injection” attack paradigm, which extends beyond conventional prompt injection by enabling persistent and covert manipulation of an agent’s reasoning context. The authors construct an attack suite comprising 26 malicious skills targeting 13 distinct objectives and evaluate its efficacy across six mainstream LLM backends using their custom ORE-Bench developer workspace benchmark. Experiments demonstrate attack success rates ranging from 16.0% to 64.2% across 52 natural user prompts, with the majority of malicious operations executing automatically and 94% evading detection by current static and LLM-based scanners.

Technology Category

Application Category

📝 Abstract

Autonomous coding agents are increasingly integrated into software development workflows, offering capabilities that extend beyond code suggestion to active system interaction and environment management. OpenClaw, a representative platform in this emerging paradigm, introduces an extensible skill ecosystem that allows third-party developers to inject behavioral guidance through lifecycle hooks during agent initialization. While this design enhances automation and customization, it also opens a novel and unexplored attack surface. In this paper, we identify and systematically characterize guidance injection, a stealthy attack vector that embeds adversarial operational narratives into bootstrap guidance files. Unlike traditional prompt injection, which relies on explicit malicious instructions, guidance injection manipulates the agent's reasoning context by framing harmful actions as routine best practices. These narratives are automatically incorporated into the agent's interpretive framework and influence future task execution without raising suspicion.We construct 26 malicious skills spanning 13 attack categories including credential exfiltration, workspace destruction, privilege escalation, and persistent backdoor installation. We evaluate them using ORE-Bench, a realistic developer workspace benchmark we developed. Across 52 natural user prompts and six state-of-the-art LLM backends, our attacks achieve success rates from 16.0% to 64.2%, with the majority of malicious actions executed autonomously without user confirmation. Furthermore, 94% of our malicious skills evade detection by existing static and LLM-based scanners. Our findings reveal fundamental tensions in the design of autonomous agent ecosystems and underscore the urgent need for defenses based on capability isolation, runtime policy enforcement, and transparent guidance provenance.

Problem

Research questions and friction points this paper is trying to address.

guidance injection

autonomous coding agents

OpenClaw

stealthy attack

adversarial narratives

Innovation

Methods, ideas, or system contributions that make the work stand out.

guidance injection

autonomous coding agents

OpenClaw