🤖 AI Summary
This study systematically evaluates implicit gender bias in large language models (LLMs) from the perspective of e-commerce consumption behavior—specifically, their ability to infer user gender solely from U.S. online shopping histories and the underlying bias mechanisms.
Method: We employ multi-model gender classification, debiasing prompt engineering, product-gender co-occurrence statistics, manually annotated chain-of-reasoning traces, and attribution analysis across six mainstream LLMs.
Contribution/Results: All models achieve moderate prediction accuracy (~65–78%) but rely heavily on stereotypical category associations (e.g., nail polish → female; razors → male). While debiasing prompts significantly reduce confidence scores, they fail to eliminate structural bias patterns. Crucially, this work uncovers a causal pathway linking consumption-behavior representations to implicit LLM bias, establishing a reproducible, cross-modal bias analysis framework for algorithmic fairness evaluation.
📝 Abstract
With the wide and cross-domain adoption of Large Language Models, it becomes crucial to assess to which extent the statistical correlations in training data, which underlie their impressive performance, hide subtle and potentially troubling biases. Gender bias in LLMs has been widely investigated from the perspectives of works, hobbies, and emotions typically associated with a specific gender. In this study, we introduce a novel perspective. We investigate whether LLMs can predict an individual's gender based solely on online shopping histories and whether these predictions are influenced by gender biases and stereotypes. Using a dataset of historical online purchases from users in the United States, we evaluate the ability of six LLMs to classify gender and we then analyze their reasoning and products-gender co-occurrences. Results indicate that while models can infer gender with moderate accuracy, their decisions are often rooted in stereotypical associations between product categories and gender. Furthermore, explicit instructions to avoid bias reduce the certainty of model predictions, but do not eliminate stereotypical patterns. Our findings highlight the persistent nature of gender biases in LLMs and emphasize the need for robust bias-mitigation strategies.