๐ค AI Summary
This study addresses the limitations of personalizing large language models (LLMs) using single demographic cues, which often introduce or amplify group biases due to oversimplified assumptions. While such approaches are common, there remains a lack of systematic evaluation regarding their real-world validity and robustness. To bridge this gap, the work presents the first comprehensive comparison of six widely used demographic cues across seven mainstream open- and closed-source LLMs. Through controlled experiments on four writing and recommendation tasks, the authors quantitatively assess how these cues influence personalized outputs. Findings reveal that despite high inter-cue correlations, significant output variations emerge across different personas, underscoring the inadequacy of relying on any single cue. The paper further proposes a multi-cue evaluation framework to enhance external validity and promote fairness in LLM personalization.
๐ Abstract
Personalization of LLMs by sociodemographic subgroup often improves user experience, but can also introduce or amplify biases and unfair outcomes across groups. Prior work has employed so-called personas, sociodemographic user attributes conveyed to a model, to study bias in LLMs by relying on a single cue to prompt a persona, such as user names or explicit attribute mentions. This disregards LLM sensitivity to prompt variations (robustness) and the rarity of some cues in real interactions (external validity). We compare six commonly used persona cues across seven open and proprietary LLMs on four writing and advice tasks. While cues are overall highly correlated, they produce substantial variance in responses across personas. We therefore caution against claims from a single persona cue and recommend future personalization research to evaluate multiple externally valid cues.