🤖 AI Summary
Existing allowlists overly rely on domain popularity, causing numerous low-frequency yet trustworthy long-tail domains to be overlooked—thereby compromising security coverage breadth and regional adaptability. To address this, we propose a bottom-up, hyperlink-structure-driven mining method that integrates seed-URL crawling, Transformer-based trust scoring, and graph-structural analysis to systematically identify and incorporate credible low-frequency domains ignored by mainstream approaches. Our framework is the first to support dual-granularity allowlist generation—both global and locale-specific. Experimental evaluation shows that our generated allowlists exhibit only 4% overlap with six major existing global allowlists and merely 0.1% overlap at the local level. Consequently, the risk of missing malicious domains is significantly reduced, particularly enhancing protection efficacy in non-English-speaking regions.
📝 Abstract
In cybersecurity, allow lists play a crucial role in distinguishing safe websites from potential threats. Conventional methods for compiling allow lists, focusing heavily on website popularity, often overlook infrequently visited legitimate domains. This paper introduces DomainHarvester, a system aimed at generating allow lists that include trustworthy yet infrequently visited domains. By adopting an innovative bottom-up methodology that leverages the web's hyperlink structure, DomainHarvester identifies legitimate yet underrepresented domains. The system uses seed URLs to gather domain names, employing machine learning with a Transformer-based approach to assess their trustworthiness. DomainHarvester has developed two distinct allow lists: one with a global focus and another emphasizing local relevance. Compared to six existing top lists, DomainHarvester's allow lists show minimal overlaps, 4% globally and 0.1% locally, while significantly reducing the risk of including malicious domains, thereby enhancing security. The contributions of this research are substantial, illuminating the overlooked aspect of trustworthy yet underrepresented domains and introducing DomainHarvester, a system that goes beyond traditional popularity-based metrics. Our methodology enhances the inclusivity and precision of allow lists, offering significant advantages to users and businesses worldwide, especially in non-English speaking regions.