🤖 AI Summary
Phishing webpage detection suffers from insufficient robustness, limited brand coverage, and vulnerability to adversarial attacks. This paper proposes a robust multimodal agent for phishing detection, introducing the first cross-modal information retrieval framework that jointly leverages logo visual recognition, HTML structure parsing, and online/offline knowledge bases. Built upon multimodal large language models (MLLMs), our method enables fine-grained brand matching and semantic consistency verification. It supports dynamic knowledge retrieval grounded in both visual and textual cues, significantly enhancing robustness against adversarial perturbations and improving detection accuracy for long-tail brands. Evaluated on three real-world datasets, our approach achieves substantial reductions in false positive and false negative rates, marked improvements in overall accuracy, and maintains high inference efficiency.
📝 Abstract
Phishing attacks are a major threat to online security, exploiting user vulnerabilities to steal sensitive information. Various methods have been developed to counteract phishing, each with varying levels of accuracy, but they also face notable limitations. In this study, we introduce PhishAgent, a multimodal agent that combines a wide range of tools, integrating both online and offline knowledge bases with Multimodal Large Language Models (MLLMs). This combination leads to broader brand coverage, which enhances brand recognition and recall. Furthermore, we propose a multimodal information retrieval framework designed to extract the relevant top k items from offline knowledge bases, using available information from a webpage, including logos and HTML. Our empirical results, based on three real-world datasets, demonstrate that the proposed framework significantly enhances detection accuracy and reduces both false positives and false negatives, while maintaining model efficiency. Additionally, PhishAgent shows strong resilience against various types of adversarial attacks.