🤖 AI Summary
Current language model (LM) agents exhibit limited performance on cybersecurity tasks—particularly Capture-the-Flag (CTF) challenges—due to their inability to authentically invoke and interact with terminal-based security tools (e.g., debuggers, network utilities) and respond dynamically to environment feedback. To address this, we propose EnIGMA: the first LM agent framework enabling end-to-end invocation of interactive terminal tools. EnIGMA introduces three key innovations: (1) an environment-aware action space, (2) CTF-specific observation compression and feedback mechanisms, and (3) an LLM-driven dynamic tool orchestration strategy. We further identify and formally characterize the “autologism” phenomenon—a previously unreported form of unintended data leakage—and propose a quantitative evaluation methodology for it. Evaluated on 390 CTF challenges across NYU CTF, Intercode-CTF, and CyBench benchmarks, EnIGMA achieves state-of-the-art performance, significantly advancing autonomous vulnerability discovery and exploitation capabilities in realistic security settings.
📝 Abstract
Although language model (LM) agents have demonstrated increased performance in multiple domains, including coding and web-browsing, their success in cybersecurity has been limited. We present EnIGMA, an LM agent for autonomously solving Capture The Flag (CTF) challenges. We introduce new tools and interfaces to improve the agent's ability to find and exploit security vulnerabilities, focusing on interactive terminal programs. These novel Interactive Agent Tools enable LM agents, for the first time, to run interactive utilities, such as a debugger and a server connection tool, which are essential for solving these challenges. Empirical analysis on 390 CTF challenges across four benchmarks demonstrate that these new tools and interfaces substantially improve our agent's performance, achieving state-of-the-art results on NYU CTF, Intercode-CTF, and CyBench. Finally, we analyze data leakage, developing new methods to quantify it and identifying a new phenomenon we term soliloquizing, where the model self-generates hallucinated observations without interacting with the environment. Our code and development dataset are available at https://github.com/SWE-agent/SWE-agent/tree/v0.7 and https://github.com/NYU-LLM-CTF/NYU_CTF_Bench/tree/main/development respectively.