PhreshPhish: A Real-World, High-Quality, Large-Scale Phishing Website Dataset and Benchmark

📅 2025-07-14

📈 Citations: 0

✨ Influential: 0

career value

240K/year

🤖 AI Summary

Existing phishing website detection research suffers from small-scale, low-quality datasets plagued by label noise, data leakage, and base-rate distortion—leading to overly optimistic model evaluations. To address these issues, we introduce PhishBench, the first large-scale, high-quality phishing dataset. It is constructed via rigorous data cleaning, manual verification, and real-world traffic sampling to eliminate label errors and leakage, while calibrating the realistic phishing base rate (~1%). We further propose a multi-difficulty benchmarking protocol covering cross-domain generalization, zero-shot transfer, and adversarial robustness. All data, code, and a unified evaluation framework are publicly released on Hugging Face, along with comprehensive results from multiple state-of-the-art baselines. PhishBench establishes the first standardized, reproducible evaluation benchmark for phishing detection research.

Technology Category

Application Category

📝 Abstract

Phishing remains a pervasive and growing threat, inflicting heavy economic and reputational damage. While machine learning has been effective in real-time detection of phishing attacks, progress is hindered by lack of large, high-quality datasets and benchmarks. In addition to poor-quality due to challenges in data collection, existing datasets suffer from leakage and unrealistic base rates, leading to overly optimistic performance results. In this paper, we introduce PhreshPhish, a large-scale, high-quality dataset of phishing websites that addresses these limitations. Compared to existing public datasets, PhreshPhish is substantially larger and provides significantly higher quality, as measured by the estimated rate of invalid or mislabeled data points. Additionally, we propose a comprehensive suite of benchmark datasets specifically designed for realistic model evaluation by minimizing leakage, increasing task difficulty, enhancing dataset diversity, and adjustment of base rates more likely to be seen in the real world. We train and evaluate multiple solution approaches to provide baseline performance on the benchmark sets. We believe the availability of this dataset and benchmarks will enable realistic, standardized model comparison and foster further advances in phishing detection. The datasets and benchmarks are available on Hugging Face (https://huggingface.co/datasets/phreshphish/phreshphish).

Problem

Research questions and friction points this paper is trying to address.

Lack of large, high-quality phishing datasets hinders ML progress

Existing datasets suffer from leakage and unrealistic base rates

Need realistic benchmarks for standardized phishing detection comparison

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale high-quality phishing website dataset

Comprehensive benchmark minimizing data leakage

Realistic base rates and enhanced diversity

🔎 Similar Papers

Evaluating the Effectiveness and Robustness of Visual Similarity-based Phishing Detection Models