Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities

📅 2024-09-24
📈 Citations: 5
Influential: 1
📄 PDF
🤖 AI Summary
Current language model (LM) agents exhibit limited performance on cybersecurity tasks—particularly Capture-the-Flag (CTF) challenges—due to their inability to authentically invoke and interact with terminal-based security tools (e.g., debuggers, network utilities) and respond dynamically to environment feedback. To address this, we propose EnIGMA: the first LM agent framework enabling end-to-end invocation of interactive terminal tools. EnIGMA introduces three key innovations: (1) an environment-aware action space, (2) CTF-specific observation compression and feedback mechanisms, and (3) an LLM-driven dynamic tool orchestration strategy. We further identify and formally characterize the “autologism” phenomenon—a previously unreported form of unintended data leakage—and propose a quantitative evaluation methodology for it. Evaluated on 390 CTF challenges across NYU CTF, Intercode-CTF, and CyBench benchmarks, EnIGMA achieves state-of-the-art performance, significantly advancing autonomous vulnerability discovery and exploitation capabilities in realistic security settings.

Technology Category

Application Category

📝 Abstract
Although language model (LM) agents have demonstrated increased performance in multiple domains, including coding and web-browsing, their success in cybersecurity has been limited. We present EnIGMA, an LM agent for autonomously solving Capture The Flag (CTF) challenges. We introduce new tools and interfaces to improve the agent's ability to find and exploit security vulnerabilities, focusing on interactive terminal programs. These novel Interactive Agent Tools enable LM agents, for the first time, to run interactive utilities, such as a debugger and a server connection tool, which are essential for solving these challenges. Empirical analysis on 390 CTF challenges across four benchmarks demonstrate that these new tools and interfaces substantially improve our agent's performance, achieving state-of-the-art results on NYU CTF, Intercode-CTF, and CyBench. Finally, we analyze data leakage, developing new methods to quantify it and identifying a new phenomenon we term soliloquizing, where the model self-generates hallucinated observations without interacting with the environment. Our code and development dataset are available at https://github.com/SWE-agent/SWE-agent/tree/v0.7 and https://github.com/NYU-LLM-CTF/NYU_CTF_Bench/tree/main/development respectively.
Problem

Research questions and friction points this paper is trying to address.

Enhance LM agents in cybersecurity tasks.
Develop tools for interactive vulnerability detection.
Quantify and address model data leakage.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Interactive Agent Tools
Autonomous CTF Solving
Quantify Data Leakage
🔎 Similar Papers
No similar papers found.
T
Talor Abramovich
Tel-Aviv University
M
Meet Udeshi
New York University
M
Minghao Shao
New York University
K
K. Lieret
Princeton Language and Intelligence, Princeton University
H
Haoran Xi
New York University
K
Kimberly Milner
New York University
S
Sofija Jancheska
New York University
J
John Yang
Stanford University
Carlos E. Jimenez
Carlos E. Jimenez
PhD Student, Princeton University
natural language processingmachine learningartificial intelligence
F
F. Khorrami
New York University
P
P. Krishnamurthy
New York University
Brendan Dolan-Gavitt
Brendan Dolan-Gavitt
New York University
SecurityMachine Learning
Muhammad Shafique
Muhammad Shafique
Professor, ECE, New York University (AD-UAE, Tandon-USA), Director eBRAIN Lab
Embedded Machine LearningBrain-Inspired ComputingRobust & Energy-Efficient System DesignSmart
Karthik Narasimhan
Karthik Narasimhan
Associate Professor, Princeton University
Natural Language ProcessingReinforcement LearningArtificial Intelligence
R
Ramesh Karri
New York University
Ofir Press
Ofir Press
Princeton University
Deep LearningNatural Language Processing