Constructing and Benchmarking: a Labeled Email Dataset for Text-Based Phishing and Spam Detection Framework

📅 2025-11-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity of fine-grained annotated datasets for phishing and spam email detection, this paper introduces the first multidimensional email dataset jointly labeled along four axes: email category (phishing/spam/legitimate), generation source (human/LLM), emotional appeal (e.g., urgency, authority), and attack intent (e.g., credential theft, financial fraud). We propose a semantic-preserving LLM rewriting test to rigorously evaluate model robustness against adversarial paraphrasing and establish a gold-standard annotation protocol via expert curation. Comprehensive benchmarking assesses multiple LLMs on both original and rewritten emails. Experiments demonstrate substantial improvements in phishing detection accuracy; however, distinguishing spam from legitimate emails remains challenging. All code, data, and annotation guidelines are publicly released to support AI-driven email security research.

Technology Category

Application Category

📝 Abstract
Phishing and spam emails remain a major cybersecurity threat, with attackers increasingly leveraging Large Language Models (LLMs) to craft highly deceptive content. This study presents a comprehensive email dataset containing phishing, spam, and legitimate messages, explicitly distinguishing between human- and LLM-generated content. Each email is annotated with its category, emotional appeal (e.g., urgency, fear, authority), and underlying motivation (e.g., link-following, credential theft, financial fraud). We benchmark multiple LLMs on their ability to identify these emotional and motivational cues and select the most reliable model to annotate the full dataset. To evaluate classification robustness, emails were also rephrased using several LLMs while preserving meaning and intent. A state-of-the-art LLM was then assessed on its performance across both original and rephrased emails using expert-labeled ground truth. The results highlight strong phishing detection capabilities but reveal persistent challenges in distinguishing spam from legitimate emails. Our dataset and evaluation framework contribute to improving AI-assisted email security systems. To support open science, all code, templates, and resources are available on our project site.
Problem

Research questions and friction points this paper is trying to address.

Develop a labeled dataset for phishing and spam detection
Benchmark LLMs' ability to identify emotional and motivational cues
Evaluate classification robustness against rephrased email content
Innovation

Methods, ideas, or system contributions that make the work stand out.

Created labeled dataset distinguishing human and LLM emails
Used LLMs to annotate emotional and motivational email cues
Evaluated detection robustness using LLM-rephrased email variants
🔎 Similar Papers
No similar papers found.
R
Rebeka Toth
University of Oslo, Oslo, Norway
Tamas Bisztray
Tamas Bisztray
Postdoctoral Researcher at University of Oslo
CybersecurityAI SafetyPrivacy and Data ProtectionIdentity ManagementBiometrics
R
Richard Dubniczky
Eötvös Loránd University , Budapest, Hungary