🤖 AI Summary
This study investigates multidimensional privacy leakage risks arising from e-commerce consumption data. Leveraging shopping records of 4,248 U.S. Amazon users paired with matched demographic and health survey data, we build predictive models to infer sensitive attributes—including race, gender, age, and diabetes status—using logistic regression and random forest classifiers. Evaluation employs AUC metrics, stratified sampling, and feature attribution analysis. To our knowledge, this is the first open-sourced, quantitative assessment of cross-dimensional privacy inference risk in e-commerce settings. We further propose a scalable analytical framework linking data scale to model inference capability. Results show high inference accuracy: gender prediction achieves AUC > 0.9; diabetes status prediction attains AUC > 0.8. Category-level analysis identifies high-risk product categories—such as specific health foods and exercise equipment—that disproportionately contribute to leakage. These findings provide empirical grounding and actionable guidance for targeted privacy-preserving interventions.
📝 Abstract
What do pickles and trampolines have in common? In this paper we show that while purchases for these products may seem innocuous, they risk revealing clues about customers' personal attributes - in this case, their race. As online retail and digital purchases become increasingly common, consumer data has become increasingly valuable, raising the risks of privacy violations and online discrimination. This work provides the first open analysis measuring these risks, using purchase histories crowdsourced from (N=4248) US Amazon.com customers and survey data on their personal attributes. With this limited sample and simple models, we demonstrate how easily consumers' personal attributes, such as health and lifestyle information, gender, age, and race, can be inferred from purchases. For example, our models achieve AUC values over 0.9 for predicting gender and over 0.8 for predicting diabetes status. To better understand the risks that highly resourced firms like Amazon, data brokers, and advertisers present to consumers, we measure how our models' predictive power scales with more data. Finally, we measure and highlight how different product categories contribute to inference risk in order to make our findings more interpretable and actionable for future researchers and privacy advocates.