Quantifying Data Contamination in Psychometric Evaluations of LLMs

📅 2025-10-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study is the first to systematically quantify data contamination risks in psychological assessment using large language models (LLMs). Addressing three contamination mechanisms—item memorization, assessment memorization, and target-score matching—we propose a multidimensional evaluation framework. Empirical analysis involves 21 mainstream LLMs and four canonical psychological scales (e.g., BFI-44, PVQ-40). Results reveal pervasive item-level memorization and score manipulation across widely used scales; certain models, especially after fine-tuning, generate responses that precisely match predefined personality profiles. Cross-model comparisons confirm the ubiquity of contamination effects and their strong dependence on scale design. These findings expose critical threats to construct validity in LLM-based psychological assessment, challenging the reliability of automated trait inference. The work establishes a methodological foundation for contamination mitigation, scale modernization, and development of trustworthy AI-driven psychometric evaluation protocols.

Technology Category

Application Category

📝 Abstract
Recent studies apply psychometric questionnaires to Large Language Models (LLMs) to assess high-level psychological constructs such as values, personality, moral foundations, and dark traits. Although prior work has raised concerns about possible data contamination from psychometric inventories, which may threaten the reliability of such evaluations, there has been no systematic attempt to quantify the extent of this contamination. To address this gap, we propose a framework to systematically measure data contamination in psychometric evaluations of LLMs, evaluating three aspects: (1) item memorization, (2) evaluation memorization, and (3) target score matching. Applying this framework to 21 models from major families and four widely used psychometric inventories, we provide evidence that popular inventories such as the Big Five Inventory (BFI-44) and Portrait Values Questionnaire (PVQ-40) exhibit strong contamination, where models not only memorize items but can also adjust their responses to achieve specific target scores.
Problem

Research questions and friction points this paper is trying to address.

Quantifying data contamination in LLM psychometric evaluations
Measuring memorization and score manipulation in psychological assessments
Assessing reliability threats from contaminated psychometric inventory data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework measures data contamination in psychometric evaluations
Evaluates item memorization, evaluation memorization, target score matching
Models memorize items and adjust responses to target scores
🔎 Similar Papers
No similar papers found.
Jongwook Han
Jongwook Han
PhD student @ Seoul National University
NLPLLM alignment
W
Woojung Song
Graduate School of Data Science, Seoul National University
J
Jonggeun Lee
Graduate School of Data Science, Seoul National University
Yohan Jo
Yohan Jo
Seoul National University
Natural Language ProcessingAgentsComputational PsychologyReasoning