Publishing Wikipedia usage data with strong privacy guarantees

📅 2023-08-30
🏛️ arXiv.org
📈 Citations: 8
Influential: 2
📄 PDF
🤖 AI Summary
This study addresses the tension between releasing fine-grained geographic information (e.g., user country) in Wikipedia pageview statistics and protecting user privacy. We propose the first differentially private publishing framework tailored for large-scale, publicly available web behavioral data. Methodologically, we systematically design a Laplace noise injection mechanism, integrated with spatiotemporally adaptive privacy budget allocation and optimized aggregation queries, to guarantee strict (ε,δ)-differential privacy for daily country-level pageview statistics. In June 2023, we launched the world’s first differentially private, country-level pageview dataset—externally audited for privacy compliance—with controllable noise and high utility; it is now routinely used by over 1,000 editors and researchers. Our core contribution lies in the first engineering deployment of differential privacy for open web behavioral statistics, successfully balancing fine-grained data release with provable, quantifiable privacy guarantees.
📝 Abstract
For almost 20 years, the Wikimedia Foundation has been publishing statistics about how many people visited each Wikipedia page on each day. This data helps Wikipedia editors determine where to focus their efforts to improve the online encyclopedia, and enables academic research. In June 2023, the Wikimedia Foundation, helped by Tumult Labs, addressed a long-standing request from Wikipedia editors and academic researchers: it started publishing these statistics with finer granularity, including the country of origin in the daily counts of page views. This new data publication uses differential privacy to provide robust guarantees to people browsing or editing Wikipedia. This paper describes this data publication: its goals, the process followed from its inception to its deployment, the algorithms used to produce the data, and the outcomes of the data release.
Problem

Research questions and friction points this paper is trying to address.

Publishing Wikipedia usage data with privacy guarantees
Providing finer granularity in page view statistics
Using differential privacy to protect user data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Differential privacy for robust guarantees
Country-level granularity in data
Collaboration with Tumult Labs
🔎 Similar Papers
No similar papers found.
T
Temilola Adeleye
Wikimedia Foundation
S
Skye Berghel
Tumult Labs
Damien Desfontaines
Damien Desfontaines
Hiding Nemo
Differential privacy
Michael Hay
Michael Hay
Colgate University
Isaac Johnson
Isaac Johnson
Wikimedia Foundation
Human-Computer InteractionSocial ComputingHuman-Centric AlgorithmsGeography and GIScience
C
Cléo Lemoisson
Wikimedia Foundation
Ashwin Machanavajjhala
Ashwin Machanavajjhala
Tumult Labs
T
Thomas Magerlein
Tumult Labs
G
G. Modena
Wikimedia Foundation
David Pujol
David Pujol
Tumult Labs
PrivacyAlgorithmic fairness
D
Daniel Simmons-Marengo
Tumult Labs
H
H. Triedman
Wikimedia Foundation