🤖 AI Summary
This paper addresses the challenge of rapidly identifying untrustworthy domains in social media and search engines. We introduce the novel concept of “dredge words”—search queries disproportionately monopolized by low-credibility domains—and propose the first joint credibility modeling framework integrating web page graphs with social propagation graphs. Methodologically, we design a novel dredge word detection algorithm that jointly models search ranking bias and social retweeting paths; we further develop a multi-source heterogeneous graph neural network (combining WebGraph with social mention/retweet graphs) to jointly learn representations of search intent and domain credibility. Contributions include: (1) releasing the first high-quality benchmark dataset of 12,000 annotated dredge words; (2) achieving state-of-the-art performance on website credibility classification; (3) significantly improving top-k untrustworthy domain identification accuracy; and (4) uncovering, for the first time, strong empirical associations between dredge words and social platforms as well as e-commerce ecosystems.
📝 Abstract
Proactive content moderation requires platforms to rapidly and continuously evaluate the credibility of websites. Leveraging the direct and indirect paths users follow to unreliable websites, we develop a website credibility classification and discovery system that integrates both webgraph and large-scale social media contexts. We additionally introduce the concept of dredge words, terms or phrases for which unreliable domains rank highly on search engines, and provide the first exploration of their usage on social media. Our graph neural networks that combine webgraph and social media contexts generate to state-of-the-art results in website credibility classification and significantly improves the top-k identification of unreliable domains. Additionally, we release a novel dataset of dredge words, highlighting their strong connections to both social media and online commerce platforms.