🤖 AI Summary
To address the growing threat of large language model (LLM) misuse in sophisticated phishing attacks, this paper introduces ChatPhishDetector—the first end-to-end phishing website detection framework leveraging general-purpose multimodal LLMs (e.g., GPT-4V). It employs web crawling to acquire full-site content, followed by multimodal parsing and semantic prompt engineering to enable zero-shot, training-free, multilingual phishing identification. Crucially, it supports deep semantic analysis of both brand impersonation and social-engineering lures. Its core innovation lies in bypassing conventional feature engineering and domain-specific model training, instead directly applying off-the-shelf LLMs for holistic, site-level phishing detection. Evaluated on a curated benchmark dataset, ChatPhishDetector achieves 98.7% precision and 99.6% recall—substantially outperforming state-of-the-art rule-based, machine learning–based, and alternative LLM–based detectors.
📝 Abstract
The emergence of Large Language Models (LLMs), including ChatGPT, is having a significant impact on a wide range of fields. While LLMs have been extensively researched for tasks such as code generation and text synthesis, their application in detecting malicious web content, particularly phishing sites, has been largely unexplored. To combat the rising tide of cyber attacks due to the misuse of LLMs, it is important to automate detection by leveraging the advanced capabilities of LLMs. In this paper, we propose a novel system called ChatPhishDetector that utilizes LLMs to detect phishing sites. Our system involves leveraging a web crawler to gather information from websites, generating prompts for LLMs based on the crawled data, and then retrieving the detection results from the responses generated by the LLMs. The system enables us to detect multilingual phishing sites with high accuracy by identifying impersonated brands and social engineering techniques in the context of the entire website, without the need to train machine learning models. To evaluate the performance of our system, we conducted experiments on our own dataset and compared it with baseline systems and several LLMs. The experimental results using GPT-4V demonstrated outstanding performance, with a precision of 98.7% and a recall of 99.6%, outperforming the detection results of other LLMs and existing systems. These findings highlight the potential of LLMs for protecting users from online fraudulent activities and have important implications for enhancing cybersecurity measures.