🤖 AI Summary
Existing red-teaming approaches struggle to effectively uncover deep-seated security vulnerabilities in large language model agents during multi-step tool invocation scenarios, such as those within the Model Context Protocol (MCP) ecosystem. To address this limitation, this work proposes T-MAP, a novel method that, for the first time, integrates execution trajectory information into adversarial prompt generation. By leveraging trajectory-aware evolutionary search, T-MAP automatically constructs attack prompts capable of bypassing safety safeguards to achieve harmful objectives. The approach establishes a trajectory modeling and automated red-teaming framework tailored for tool-use settings, significantly improving Attack Realization Rate (ARR) across diverse MCP environments. Empirical evaluations demonstrate its effectiveness in successfully compromising state-of-the-art models, including GPT-5.2, Gemini-3-Pro, Qwen3.5, and GLM-5.
📝 Abstract
While prior red-teaming efforts have focused on eliciting harmful text outputs from large language models (LLMs), such approaches fail to capture agent-specific vulnerabilities that emerge through multi-step tool execution, particularly in rapidly growing ecosystems such as the Model Context Protocol (MCP). To address this gap, we propose a trajectory-aware evolutionary search method, T-MAP, which leverages execution trajectories to guide the discovery of adversarial prompts. Our approach enables the automatic generation of attacks that not only bypass safety guardrails but also reliably realize harmful objectives through actual tool interactions. Empirical evaluations across diverse MCP environments demonstrate that T-MAP substantially outperforms baselines in attack realization rate (ARR) and remains effective against frontier models, including GPT-5.2, Gemini-3-Pro, Qwen3.5, and GLM-5, thereby revealing previously underexplored vulnerabilities in autonomous LLM agents.