🤖 AI Summary
This study addresses the tension between releasing fine-grained geographic information (e.g., user country) in Wikipedia pageview statistics and protecting user privacy. We propose the first differentially private publishing framework tailored for large-scale, publicly available web behavioral data. Methodologically, we systematically design a Laplace noise injection mechanism, integrated with spatiotemporally adaptive privacy budget allocation and optimized aggregation queries, to guarantee strict (ε,δ)-differential privacy for daily country-level pageview statistics. In June 2023, we launched the world’s first differentially private, country-level pageview dataset—externally audited for privacy compliance—with controllable noise and high utility; it is now routinely used by over 1,000 editors and researchers. Our core contribution lies in the first engineering deployment of differential privacy for open web behavioral statistics, successfully balancing fine-grained data release with provable, quantifiable privacy guarantees.
📝 Abstract
For almost 20 years, the Wikimedia Foundation has been publishing statistics about how many people visited each Wikipedia page on each day. This data helps Wikipedia editors determine where to focus their efforts to improve the online encyclopedia, and enables academic research. In June 2023, the Wikimedia Foundation, helped by Tumult Labs, addressed a long-standing request from Wikipedia editors and academic researchers: it started publishing these statistics with finer granularity, including the country of origin in the daily counts of page views. This new data publication uses differential privacy to provide robust guarantees to people browsing or editing Wikipedia. This paper describes this data publication: its goals, the process followed from its inception to its deployment, the algorithms used to produce the data, and the outcomes of the data release.