🤖 AI Summary
Lack of publicly available, realistic, and high-fidelity bidding data hinders offline evaluation and modeling of privacy-enhancing advertising systems (e.g., Chrome’s Privacy Sandbox). To address this, we introduce the first large-scale, real-world bidding dataset explicitly designed for such systems, covering key constraints including delayed reporting, user- and impression-level differential privacy, signal quantization, and aggregated reporting. We propose a production-grade anonymization framework that jointly integrates differential privacy, log sanitization, and fidelity calibration—ensuring strong privacy guarantees while preserving evaluation validity. The dataset features state-of-the-art scale and diversity; models trained on it achieve offline performance comparable to Criteo’s production system. We have open-sourced the dataset on Hugging Face, filling a critical gap in high-quality, privacy-preserving advertising benchmarks and establishing foundational infrastructure for privacy-first advertising research.
📝 Abstract
In the past years, many proposals have emerged in order to address online advertising use-cases without access to third-party cookies. All these proposals leverage some privacy-enhancing technologies such as aggregation or differential privacy. Yet, no public and rich-enough ground truth is currently available to assess the relevancy of aforementioned private advertising frameworks. We are releasing the largest, in terms of number of features, bidding dataset specifically built in alignment with the design of major browser vendors proposals such as Chrome Privacy Sandbox. This dataset, coined CriteoPrivateAd, stands for an anonymised version of Criteo production logs and provides sufficient data to learn bidding models commonly used in online advertising under many privacy constraints (delayed reports, display and user-level differential privacy, user signal quantisation or aggregated reports). We ensured that this dataset, while being anonymised, is able to provide offline results close to production performance of adtech companies including Criteo - making it a relevant ground truth to design private advertising systems. The dataset is available in Hugging Face: https://huggingface.co/datasets/criteo/CriteoPrivateAd.