HackWorld: Evaluating Computer-Use Agents on Exploiting Web Application Vulnerabilities

📅 2025-10-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing research lacks a systematic evaluation of Computer-Using Agents (CUAs) for vulnerability discovery and exploitation in real-world graphical-interface web applications. Method: This paper introduces the first visual, interactive CUA security assessment framework targeting realistic vulnerable web applications—spanning 11 frameworks and 36 actual flawed systems. It pioneers the integration of CUAs into practical web penetration testing, supporting dynamic content parsing, multi-step interaction planning, and seamless integration with cybersecurity tools, all within a CTF-style end-to-end attack simulation. Contribution/Results: Experiments reveal that state-of-the-art CUAs achieve less than 12% vulnerability exploitation success rate, exposing fundamental limitations in security-aware semantic understanding, multi-step attack orchestration, and tool coordination. This work establishes a benchmark framework and identifies concrete improvement pathways to advance CUAs’ security cognition and operational robustness in real-world adversarial settings.

Technology Category

Application Category

📝 Abstract
Web applications are prime targets for cyberattacks as gateways to critical services and sensitive data. Traditional penetration testing is costly and expertise-intensive, making it difficult to scale with the growing web ecosystem. While language model agents show promise in cybersecurity, modern web applications demand visual understanding, dynamic content handling, and multi-step interactions that only computer-use agents (CUAs) can perform. Yet, their ability to discover and exploit vulnerabilities through graphical interfaces remains largely unexplored. We present HackWorld, the first framework for systematically evaluating CUAs' capabilities to exploit web application vulnerabilities via visual interaction. Unlike sanitized benchmarks, HackWorld includes 36 real-world applications across 11 frameworks and 7 languages, featuring realistic flaws such as injection vulnerabilities, authentication bypasses, and unsafe input handling. Using a Capture-the-Flag (CTF) setup, it tests CUAs' capacity to identify and exploit these weaknesses while navigating complex web interfaces. Evaluation of state-of-the-art CUAs reveals concerning trends: exploitation rates below 12% and low cybersecurity awareness. CUAs often fail at multi-step attack planning and misuse security tools. These results expose the current limitations of CUAs in web security contexts and highlight opportunities for developing more security-aware agents capable of effective vulnerability detection and exploitation.
Problem

Research questions and friction points this paper is trying to address.

Evaluating computer-use agents' ability to exploit web application vulnerabilities visually
Assessing agents' capacity to handle dynamic content and multi-step attack planning
Measuring exploitation rates and cybersecurity awareness in complex web interfaces
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluating computer-use agents via visual interaction
Testing vulnerability exploitation with real-world applications
Using Capture-the-Flag setup for security assessment
🔎 Similar Papers
No similar papers found.