π€ AI Summary
This work addresses the limitations of traditional sociodemographic prompting (SDP) in evaluating cultural alignment of large language models, as SDP is susceptible to confounding factors such as prompt sensitivity, decoding parameters, and task complexity, making it difficult to disentangle model bias from flaws in task design. To overcome these issues, the authors propose Inverse Sociodemographic Prompting (ISDP), which reframes cultural alignment assessment from a generation task into a discrimination task by prompting models to identify user groups based on real or simulated user behaviors. Experiments on the Goodreads-CSI dataset with models including Aya-23, Gemma-2, GPT-4o, and LLaMA-3.1 show that while models generally perform better on real user behaviors, their individual-level discriminative performance converges, revealing a significant bottleneck in current large language modelsβ capacity for deep, personalized cultural understanding.
π Abstract
Socio-demographic prompting (SDP) - prompting Large Language Models (LLMs) using demographic proxies to generate culturally aligned outputs - often shows LLM responses as stereotypical and biased. While effective in assessing LLMs'cultural competency, SDP is prone to confounding factors such as prompt sensitivity, decoding parameters, and the inherent difficulty of generation over discrimination tasks due to larger output spaces. These factors complicate interpretation, making it difficult to determine if the poor performance is due to bias or the task design. To address this, we use inverse socio-demographic prompting (ISDP), where we prompt LLMs to discriminate and predict the demographic proxy from actual and simulated user behavior from different users. We use the Goodreads-CSI dataset (Saha et al., 2025), which captures difficulty in understanding English book reviews for users from India, Mexico, and the USA, and test four LLMs: Aya-23, Gemma-2, GPT-4o, and LLaMA-3.1 with ISDP. Results show that models perform better with actual behaviors than simulated ones, contrary to what SDP suggests. However, performance with both behavior types diminishes and becomes nearly equal at the individual level, indicating limits to personalization.