🤖 AI Summary
This study investigates how large language models internally represent racial and ethnic information and uncovers the mechanisms through which such representations engender both explicit and implicit biases in high-stakes domains like healthcare. By constructing an end-to-end interpretable AI analysis pipeline that integrates probing, neuron-level attribution methods (e.g., Integrated Gradients), and targeted neuron ablation, the work provides the first systematic characterization of the distributed nature of racial/ethnic representations within these models. The findings reveal that identical demographic cues can lead to qualitatively divergent model behaviors, and that current debiasing approaches merely alter surface-level outputs without modifying underlying representations. Moreover, key neurons encoding stereotypical associations are identified; while their intervention reduces bias, substantial residual effects persist, underscoring the need for more comprehensive representation-level mitigation strategies.
📝 Abstract
Large language models (LLMs) increasingly operate in high-stakes settings including healthcare and medicine, where demographic attributes such as race and ethnicity may be explicitly stated or implicitly inferred from text. However, existing studies primarily document outcome-level disparities, offering limited insight into internal mechanisms underlying these effects. We present a mechanistic study of how race and ethnicity are represented and operationalized within LLMs. Using two publicly available datasets spanning toxicity-related generation and clinical narrative understanding tasks, we analyze three open-source models with a reproducible interpretability pipeline combining probing, neuron-level attribution, and targeted intervention. We find that demographic information is distributed across internal units with substantial cross-model variation. Although some units encode sensitive or stereotype-related associations from pretraining, identical demographic cues can induce qualitatively different behaviors. Interventions suppressing such neurons reduce bias but leave substantial residual effects, suggesting behavioral rather than representational change and motivating more systematic mitigation.