CriteoPrivateAd: A Real-World Bidding Dataset to Design Private Advertising Systems

📅 2025-02-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Lack of publicly available, realistic, and high-fidelity bidding data hinders offline evaluation and modeling of privacy-enhancing advertising systems (e.g., Chrome’s Privacy Sandbox). To address this, we introduce the first large-scale, real-world bidding dataset explicitly designed for such systems, covering key constraints including delayed reporting, user- and impression-level differential privacy, signal quantization, and aggregated reporting. We propose a production-grade anonymization framework that jointly integrates differential privacy, log sanitization, and fidelity calibration—ensuring strong privacy guarantees while preserving evaluation validity. The dataset features state-of-the-art scale and diversity; models trained on it achieve offline performance comparable to Criteo’s production system. We have open-sourced the dataset on Hugging Face, filling a critical gap in high-quality, privacy-preserving advertising benchmarks and establishing foundational infrastructure for privacy-first advertising research.

Technology Category

Application Category

📝 Abstract
In the past years, many proposals have emerged in order to address online advertising use-cases without access to third-party cookies. All these proposals leverage some privacy-enhancing technologies such as aggregation or differential privacy. Yet, no public and rich-enough ground truth is currently available to assess the relevancy of aforementioned private advertising frameworks. We are releasing the largest, in terms of number of features, bidding dataset specifically built in alignment with the design of major browser vendors proposals such as Chrome Privacy Sandbox. This dataset, coined CriteoPrivateAd, stands for an anonymised version of Criteo production logs and provides sufficient data to learn bidding models commonly used in online advertising under many privacy constraints (delayed reports, display and user-level differential privacy, user signal quantisation or aggregated reports). We ensured that this dataset, while being anonymised, is able to provide offline results close to production performance of adtech companies including Criteo - making it a relevant ground truth to design private advertising systems. The dataset is available in Hugging Face: https://huggingface.co/datasets/criteo/CriteoPrivateAd.
Problem

Research questions and friction points this paper is trying to address.

Lack of public dataset for private advertising systems
Dataset aligns with browser vendors' privacy proposals
Anonymised dataset supports offline ad performance testing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Largest bidding dataset released
Aligns with Chrome Privacy Sandbox
Ensures anonymized yet production-like results
🔎 Similar Papers
No similar papers found.
M
Mehdi Sebbar
Criteo AI Lab, Paris, France
C
Corentin Odic
Criteo AI Lab, Paris, France
M
Mathieu L'echine
Criteo AI Lab, Paris, France
A
Alois Bissuel
Criteo, Paris, France
N
Nicolas Chrysanthos
Criteo AI Lab, Paris, France
Anthony D'Amato
Anthony D'Amato
Criteo AI Lab, Paris, France
Alexandre Gilotte
Alexandre Gilotte
Criteo AI Lab
Machine Learning
F
Fabian Horing
Criteo, Paris, France
S
Sarah Nogueira
Criteo AI Lab, Paris, France
Maxime Vono
Maxime Vono
Staff Research Lead, Criteo AI Lab
Machine LearningComputational StatisticsSignal Processing