To Generate or Discriminate? Methodological Considerations for Measuring Cultural Alignment in LLMs

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the limitations of traditional sociodemographic prompting (SDP) in evaluating cultural alignment of large language models, as SDP is susceptible to confounding factors such as prompt sensitivity, decoding parameters, and task complexity, making it difficult to disentangle model bias from flaws in task design. To overcome these issues, the authors propose Inverse Sociodemographic Prompting (ISDP), which reframes cultural alignment assessment from a generation task into a discrimination task by prompting models to identify user groups based on real or simulated user behaviors. Experiments on the Goodreads-CSI dataset with models including Aya-23, Gemma-2, GPT-4o, and LLaMA-3.1 show that while models generally perform better on real user behaviors, their individual-level discriminative performance converges, revealing a significant bottleneck in current large language models’ capacity for deep, personalized cultural understanding.

Technology Category

Application Category

📝 Abstract

Socio-demographic prompting (SDP) - prompting Large Language Models (LLMs) using demographic proxies to generate culturally aligned outputs - often shows LLM responses as stereotypical and biased. While effective in assessing LLMs'cultural competency, SDP is prone to confounding factors such as prompt sensitivity, decoding parameters, and the inherent difficulty of generation over discrimination tasks due to larger output spaces. These factors complicate interpretation, making it difficult to determine if the poor performance is due to bias or the task design. To address this, we use inverse socio-demographic prompting (ISDP), where we prompt LLMs to discriminate and predict the demographic proxy from actual and simulated user behavior from different users. We use the Goodreads-CSI dataset (Saha et al., 2025), which captures difficulty in understanding English book reviews for users from India, Mexico, and the USA, and test four LLMs: Aya-23, Gemma-2, GPT-4o, and LLaMA-3.1 with ISDP. Results show that models perform better with actual behaviors than simulated ones, contrary to what SDP suggests. However, performance with both behavior types diminishes and becomes nearly equal at the individual level, indicating limits to personalization.

Problem

Research questions and friction points this paper is trying to address.

cultural alignment

socio-demographic prompting

large language models

bias

generation vs. discrimination

Innovation

Methods, ideas, or system contributions that make the work stand out.

Inverse Socio-demographic Prompting

Cultural Alignment

Discrimination Task