CrediBench: Building Web-Scale Network Datasets for Information Integrity

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing approaches to misinformation detection often model webpage content and hyperlink structure in isolation, neglecting their dynamic co-evolution. Method: This paper introduces the first framework jointly modeling textual content and hyperlinked structural dynamics over time. We construct a temporal network graph pipeline using Common Crawl data, enabling time-sliced, longitudinal representation of the Web. By integrating NLP-derived features with graph neural networks, we generate a large-scale, publicly available web graph dataset—comprising 45 million nodes and 1 billion edges spanning multiple months—the largest of its kind to date. Contribution/Results: The dataset fully captures both website content updates and evolving citation relationships. Empirical evaluation demonstrates that jointly leveraging content and structural signals significantly improves credibility scoring performance. This work establishes a new paradigm for studying the misinformation ecosystem and provides critical infrastructure for future research.

Technology Category

Application Category

📝 Abstract

Online misinformation poses an escalating threat, amplified by the Internet's open nature and increasingly capable LLMs that generate persuasive yet deceptive content. Existing misinformation detection methods typically focus on either textual content or network structure in isolation, failing to leverage the rich, dynamic interplay between website content and hyperlink relationships that characterizes real-world misinformation ecosystems. We introduce CrediBench: a large-scale data processing pipeline for constructing temporal web graphs that jointly model textual content and hyperlink structure for misinformation detection. Unlike prior work, our approach captures the dynamic evolution of general misinformation domains, including changes in both content and inter-site references over time. Our processed one-month snapshot extracted from the Common Crawl archive in December 2024 contains 45 million nodes and 1 billion edges, representing the largest web graph dataset made publicly available for misinformation research to date. From our experiments on this graph snapshot, we demonstrate the strength of both structural and webpage content signals for learning credibility scores, which measure source reliability. The pipeline and experimentation code are all available here, and the dataset is in this folder.

Problem

Research questions and friction points this paper is trying to address.

Addresses isolated analysis of text and network structure in misinformation detection

Captures dynamic evolution of content and hyperlinks in misinformation domains

Builds large-scale temporal web graphs integrating content and network signals

Innovation

Methods, ideas, or system contributions that make the work stand out.

Building temporal web graphs for misinformation detection

Jointly modeling textual content and hyperlink structure

Processing web-scale datasets with 45M nodes and 1B edges

🔎 Similar Papers

Bridging Social Media and Search Engines: Dredge Words and the Detection of Unreliable Domains