🤖 AI Summary
Popularity bias in recommender systems leads to insufficient exposure of long-tail items, undermining fairness and overall recommendation quality. To address this, we propose a post-hoc method based on sparse autoencoders (SAEs), the first to leverage SAEs for interpretable identification and mitigation of neuron-level popularity bias in deep recommendation models: synthetic user activation patterns are used to localize biased neurons, enabling fine-grained intervention during inference. Our approach requires no modification to the original model architecture or training objective, ensuring high transparency and operational flexibility. Experiments on two public benchmark datasets demonstrate that the method significantly improves recommendation fairness—e.g., reducing ILAD and Gini coefficient by 12.6%–28.3%—while incurring less than 0.5% degradation in accuracy, thereby achieving a controllable and interpretable fairness–accuracy trade-off.
📝 Abstract
Popularity bias is a well-known challenge in recommender systems, where a small number of popular items receive disproportionate attention, while the majority of less popular items are largely overlooked. This imbalance often results in reduced recommendation quality and unfair exposure of items. Although existing mitigation techniques address this bias to some extent, they typically lack transparency in how they operate. In this paper, we propose a post-hoc method using a Sparse Autoencoder (SAE) to interpret and mitigate popularity bias in deep recommendation models. The SAE is trained to replicate a pre-trained model's behavior while enabling neuron-level interpretability. By introducing synthetic users with clear preferences for either popular or unpopular items, we identify neurons encoding popularity signals based on their activation patterns. We then adjust the activations of the most biased neurons to steer recommendations toward fairer exposure. Experiments on two public datasets using a sequential recommendation model show that our method significantly improves fairness with minimal impact on accuracy. Moreover, it offers interpretability and fine-grained control over the fairness-accuracy trade-off.