Do Psychometric Tests Work for Large Language Models? Evaluation of Tests on Sexism, Racism, and Morality

📅 2025-10-13

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This study investigates the applicability and ecological validity of human psychometric instruments—such as gender/racial bias and moral judgment scales—when adapted for evaluating large language models (LLMs). Employing a comprehensive assessment framework comprising multi-round item design, prompt variation testing, convergent validity analysis, and behavioral alignment with downstream tasks, we systematically evaluated the reliability and validity of 12 widely used psychological tests across LLMs. Results indicate moderate internal consistency (Cronbach’s α ≈ 0.65–0.78) but critically low ecological validity: model psychometric scores exhibit weak or even negative correlations with actual discriminatory outputs and fairness-related decision-making in realistic scenarios. The core contribution is the first proposal and empirical validation of an ecological validity evaluation paradigm specifically tailored for LLMs, demonstrating fundamental limitations in directly transplanting human-centered scales. These findings provide critical empirical grounding for theoretical reconceptualization and methodological innovation in AI-oriented psychological assessment.

Technology Category

Application Category

📝 Abstract

Psychometric tests are increasingly used to assess psychological constructs in large language models (LLMs). However, it remains unclear whether these tests -- originally developed for humans -- yield meaningful results when applied to LLMs. In this study, we systematically evaluate the reliability and validity of human psychometric tests for three constructs: sexism, racism, and morality. We find moderate reliability across multiple item and prompt variations. Validity is evaluated through both convergent (i.e., testing theory-based inter-test correlations) and ecological approaches (i.e., testing the alignment between tests scores and behavior in real-world downstream tasks). Crucially, we find that psychometric test scores do not align, and in some cases even negatively correlate with, model behavior in downstream tasks, indicating low ecological validity. Our results highlight that systematic evaluations of psychometric tests is essential before interpreting their scores. They also suggest that psychometric tests designed for humans cannot be applied directly to LLMs without adaptation.

Problem

Research questions and friction points this paper is trying to address.

Evaluating psychometric test reliability for LLMs

Assessing validity of human tests on AI models

Testing alignment between scores and real behavior

Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluated psychometric tests reliability and validity

Found low ecological validity in downstream tasks

Showed human tests need adaptation for LLMs

🔎 Similar Papers

LangBiTe: A Platform for Testing Bias in Large Language Models