🤖 AI Summary
To address the vulnerability and poor robustness of traditional string-based URL features in phishing website detection—easily evaded by obfuscation techniques—this paper proposes a heterogeneous graph modeling approach that jointly encodes URL structural patterns and network-layer entities (e.g., IP addresses, authoritative DNS servers). We introduce a novel dynamic edge potential mechanism and an enhanced, convergence-guaranteed loopy belief propagation (LBP) algorithm, enabling stable and interpretable probabilistic inference over complex heterogeneous graphs. Furthermore, we integrate graph neural networks with network topology-aware feature extraction to support end-to-end phishing URL classification. Evaluated on real-world datasets, our method achieves an F1-score of 98.77%, significantly outperforming state-of-the-art approaches. The framework demonstrates high reproducibility and practical deployability in operational security systems.
📝 Abstract
The proliferation of mobile devices and online interactions have been threatened by different cyberattacks, where phishing attacks and malicious Uniform Resource Locators (URLs) pose significant risks to user security. Traditional phishing URL detection methods primarily rely on URL string-based features, which attackers often manipulate to evade detection. To address these limitations, we propose a novel graph-based machine learning model for phishing URL detection, integrating both URL structure and network-level features such as IP addresses and authoritative name servers. Our approach leverages Loopy Belief Propagation (LBP) with an enhanced convergence strategy to enable effective message passing and stable classification in the presence of complex graph structures. Additionally, we introduce a refined edge potential mechanism that dynamically adapts based on entity similarity and label relationships to further improve classification accuracy. Comprehensive experiments on real-world datasets demonstrate our model's effectiveness by achieving F1 score of up to 98.77%. This robust and reproducible method advances phishing detection capabilities, offering enhanced reliability and valuable insights in the field of cybersecurity.