A New Dataset and Methodology for Malicious URL Classification

📅 2024-12-31
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low accuracy, poor real-time performance, and lack of high-quality, multi-class benchmark datasets in existing malicious URL detection models, this paper introduces DeepURLBench—the first open-source, three-class (benign/phishing/malicious) labeled URL dataset—and proposes an enhanced URLNet model incorporating DNS resolution features. The method extends string-level URL representation learning by integrating lightweight, semantically rich DNS-derived features, transcending conventional binary classification paradigms through multi-class imbalance mitigation and efficient feature engineering. Experimental results demonstrate an accuracy of 98.7% and a 4.2-percentage-point improvement in macro-F1 score, with inference latency under 15 ms. DeepURLBench establishes the most rigorous and largest publicly available multi-class URL benchmark to date, providing both a new evaluation standard and an effective, low-latency solution for real-time web security defense.

Technology Category

Application Category

📝 Abstract
Malicious URL (Uniform Resource Locator) classification is a pivotal aspect of Cybersecurity, offering defense against web-based threats. Despite deep learning's promise in this area, its advancement is hindered by two main challenges: the scarcity of comprehensive, open-source datasets and the limitations of existing models, which either lack real-time capabilities or exhibit suboptimal performance. In order to address these gaps, we introduce a novel, multi-class dataset for malicious URL classification, distinguishing between benign, phishing and malicious URLs, named DeepURLBench. The data has been rigorously cleansed and structured, providing a superior alternative to existing datasets. Notably, the multi-class approach enhances the performance of deep learning models, as compared to a standard binary classification approach. Additionally, we propose improvements to string-based URL classifiers, applying these enhancements to URLNet. Key among these is the integration of DNS-derived features, which enrich the model's capabilities and lead to notable performance gains while preserving real-time runtime efficiency-achieving an effective balance for cybersecurity applications.
Problem

Research questions and friction points this paper is trying to address.

Malicious URL Detection
Deep Learning Models
Data Quality
Innovation

Methods, ideas, or system contributions that make the work stand out.

DeepURLBench
Multi-Classification Optimization
DNS Integration
🔎 Similar Papers
No similar papers found.