🤖 AI Summary
This study addresses a critical limitation in current approaches to probing demographic attributes in large language models, which often assume that single cues—such as names or dialects—adequately represent specific social groups without validating their construct validity. Through a series of controlled experiments in authentic advice-seeking interactions, the authors systematically evaluate how different racial and gender prompts influence model behavior, employing multi-cue comparisons, behavioral disparity analyses, and controls for linguistic confounds. The findings reveal, for the first time, that distinct prompts associated with the same demographic group elicit markedly inconsistent model responses, while differences between groups are minimal and unstable. These results fundamentally challenge the construct validity of prevailing probing paradigms, suggesting they cannot reliably capture population-conditioned behaviors in language models.
📝 Abstract
Demographic probing is widely used to study how large language models (LLMs) adapt their behavior to signaled demographic attributes. This approach typically uses a single demographic cue in isolation (e.g., a name or dialect) as a signal for group membership, implicitly assuming strong construct validity: that such cues are interchangeable operationalizations of the same underlying, demographically conditioned behavior. We test this assumption in realistic advice-seeking interactions, focusing on race and gender in a U.S. context. We find that cues intended to represent the same demographic group induce only partially overlapping changes in model behavior, while differentiation between groups within a given cue is weak and uneven. Consequently, estimated disparities are unstable, with both magnitude and direction varying across cues. We further show that these inconsistencies partly arise from variation in how strongly cues encode demographic attributes and from linguistic confounders that independently shape model behavior. Together, our findings suggest that demographic probing lacks construct validity: it does not yield a single, stable characterization of how LLMs condition on demographic information, which may reflect a misspecified or fragmented construct. We conclude by recommending the use of multiple, ecologically valid cues and explicit control of confounders to support more defensible claims about demographic effects in LLMs.