Web2Wiki: Characterizing Wikipedia Linking Across the Web

📅 2025-05-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically investigates Wikipedia’s citation role and influence across the global web—the first such large-scale analysis. To address the lack of empirical evidence on Wikipedia’s cross-domain referential impact, we extract and analyze over 90 million multilingual Wikipedia links from Common Crawl data, covering 1.68% of all registered domains. Our methodology integrates URL parsing, context-aware classification (distinguishing main content, templates, and user-generated content), automated domain annotation (e.g., news, science, business), and multilingual normalization. Results show that 95% of Wikipedia links serve an explanatory function, with Wikipedia predominantly cited as background knowledge within primary content regions—especially by news and scientific websites. As a key contribution, we release Web2Wiki: the first open, multilingual, web-scale dataset of Wikipedia citations, enabling rigorous empirical research on knowledge ecosystems, information diffusion, and cross-lingual knowledge infrastructure.

Technology Category

Application Category

📝 Abstract
Wikipedia is one of the most visited websites globally, yet its role beyond its own platform remains largely unexplored. In this paper, we present the first large-scale analysis of how Wikipedia is referenced across the Web. Using a dataset from Common Crawl, we identify over 90 million Wikipedia links spanning 1.68% of Web domains and examine their distribution, context, and function. Our analysis of English Wikipedia reveals three key findings: (1) Wikipedia is most frequently cited by news and science websites for informational purposes, while commercial websites reference it less often. (2) The majority of Wikipedia links appear within the main content rather than in boilerplate or user-generated sections, highlighting their role in structured knowledge presentation. (3) Most links (95%) serve as explanatory references rather than as evidence or attribution, reinforcing Wikipedia's function as a background knowledge provider. While this study focuses on English Wikipedia, our publicly released Web2Wiki dataset includes links from multiple language editions, supporting future research on Wikipedia's global influence on the Web.
Problem

Research questions and friction points this paper is trying to address.

Analyzing Wikipedia's external web references globally
Examining distribution and purpose of Wikipedia links
Assessing Wikipedia's role as knowledge provider
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale analysis of Wikipedia web references
Dataset from Common Crawl with 90M links
Examines link distribution, context, and function
🔎 Similar Papers
No similar papers found.