🤖 AI Summary
This work investigates whether fairness in model representations reliably reflects fairness in recommendation outcomes—termed recommendation consistency. Through extensive multi-model comparative experiments on both real-world and synthetic datasets, the study integrates demographic attribute classification tasks, recommendation list analysis, and diverse fairness metrics to propose two novel measures of demographic predictability grounded in ranked recommendations. The findings reveal that while optimizing representation-level fairness can enhance recommendation fairness, evaluations at the representation level alone are insufficient as a proxy for outcome-level fairness. This challenges prevailing evaluation paradigms in fair recommendation research and offers a new methodological perspective for assessing and designing equitable recommender systems.
📝 Abstract
One of the many fairness definitions pursued in recent recommender system research targets mitigating demographic information encoded in model representations. Models optimized for this definition are typically evaluated on how well demographic attributes can be classified given model representations, with the (implicit) assumption that this measure accurately reflects \textit{recommendation parity}, i.e., how similar recommendations given to different users are. We challenge this assumption by comparing the amount of demographic information encoded in representations with various measures of how the recommendations differ. We propose two new approaches for measuring how well demographic information can be classified given ranked recommendations. Our results from extensive testing of multiple models on one real and multiple synthetically generated datasets indicate that optimizing for fair representations positively affects recommendation parity, but also that evaluation at the representation level is not a good proxy for measuring this effect when comparing models. We also provide extensive insight into how recommendation-level fairness metrics behave for various models by evaluating their performances on numerous generated datasets with different properties.