🤖 AI Summary
Existing black-box scanners suffer from low test coverage and limited vulnerability detection due to their inability to understand user interaction semantics, particularly for deep-functionality pages requiring multi-step navigation. This work proposes the first large language model (LLM)-driven, semantic-aware black-box scanning framework. It models user intent via LLM-based intent inference and integrates dynamic page navigation with semantics-guided crawling to enable intent-oriented, intelligent path exploration—breaking away from conventional structure- or randomness-based crawling paradigms. Experimental evaluation across 12 open-source web applications demonstrates that our approach achieves, on average, twice the page coverage of state-of-the-art tools; over 90% of generated requests target core functional modules; and it successfully uncovers multiple previously undetected high-severity vulnerabilities.
📝 Abstract
Black-box scanners have played a significant role in detecting vulnerabilities for web applications. A key focus in current black-box scanning is increasing test coverage (i.e., accessing more web pages). However, since many web applications are user-oriented, some deep pages can only be accessed through complex user interactions, which are difficult to reach by existing black-box scanners. To fill this gap, a key insight is that web pages contain a wealth of semantic information that can aid in understanding potential user intention. Based on this insight, we propose Hoyen, a black-box scanner that uses the Large Language Model to predict user intention and provide guidance for expanding the scanning scope. Hoyen has been rigorously evaluated on 12 popular open-source web applications and compared with 6 representative tools. The results demonstrate that Hoyen performs a comprehensive exploration of web applications, expanding the attack surface while achieving about 2x than the coverage of other scanners on average, with high request accuracy. Furthermore, Hoyen detected over 90% of its requests towards the core functionality of the application, detecting more vulnerabilities than other scanners, including unique vulnerabilities in well-known web applications. Our data/code is available at https://hoyen.tjunsl.com/