๐ค AI Summary
This study investigates whether natural language (NL) music taste profiles automatically generated by large language models (LLMs) are perceived as accurate by users, and whether systematic biases exist with respect to user attributes (e.g., mainstreamness, taste diversity) and item characteristics (e.g., genre, country of origin). Leveraging listening histories to generate NL profiles, we conduct a large-scale user survey and evaluate downstream recommendation performance to establish, for the first time, a joint analysis of profile endorsement and recommendation fairness. Results reveal that users with higher mainstreamness and lower taste diversity exhibit significantly greater endorsement; conversely, non-Western and niche-genre items substantially reduce endorsementโand this bias persists in recommendation accuracy and coverage. Our work uncovers implicit cultural and cognitive biases embedded in LLM-driven explainable recommendation systems, providing novel empirical evidence and a methodological framework for designing fairer, more trustworthy personalized recommender systems.
๐ Abstract
One particularly promising use case of Large Language Models (LLMs) for recommendation is the automatic generation of Natural Language (NL) user taste profiles from consumption data. These profiles offer interpretable and editable alternatives to opaque collaborative filtering representations, enabling greater transparency and user control. However, it remains unclear whether users consider these profiles to be an accurate representation of their taste, which is crucial for trust and usability. Moreover, because LLMs inherit societal and data-driven biases, profile quality may systematically vary across user and item characteristics. In this paper, we study this issue in the context of music streaming, where personalization is challenged by a large and culturally diverse catalog. We conduct a user study in which participants rate NL profiles generated from their own listening histories. We analyze whether identification with the profiles is biased by user attributes (e.g., mainstreamness, taste diversity) and item features (e.g., genre, country of origin). We also compare these patterns to those observed when using the profiles in a downstream recommendation task. Our findings highlight both the potential and limitations of scrutable, LLM-based profiling in personalized systems.