Measure what Matters: Psychometric Evaluation of AI with Situational Judgment Tests

📅 2025-10-25

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Existing AI psychometrics predominantly repurpose human personality inventories (e.g., Big Five, HEXACO) or ad hoc role definitions, resulting in behavioral distortion and poor domain adaptability. To address this, we propose the first Situation Judgment Test (SJT) framework specifically designed for AI systems, integrating industrial-organizational psychology and personality theory to construct fine-grained, socioemotionally capable virtual personas. Our method innovatively incorporates demographic prior modeling and autobiographical narrative generation, coupled with Pydantic-based structured generation, enabling interpretable and reproducible AI personality modeling and behavioral analysis. We instantiate this framework in a law enforcement assistant scenario, curating a large-scale benchmark: 8,500 virtual personas, 4,000 situational judgment items, and 300,000 AI responses—spanning eight archetype categories and eleven competency dimensions. All data and code are publicly released.

Technology Category

Application Category

📝 Abstract

AI psychometrics evaluates AI systems in roles that traditionally require emotional judgment and ethical consideration. Prior work often reuses human trait inventories (Big Five, hexaco) or ad hoc personas, limiting behavioral realism and domain relevance. We propose a framework that (1) uses situational judgment tests (SJTs) from realistic scenarios to probe domain-specific competencies; (2) integrates industrial-organizational and personality psychology to design sophisticated personas which include behavioral and psychological descriptors, life history, and social and emotional functions; and (3) employs structured generation with population demographic priors and memoir inspired narratives, encoded with Pydantic schemas. In a law enforcement assistant case study, we construct a rich dataset of personas drawn across 8 persona archetypes and SJTs across 11 attributes, and analyze behaviors across subpopulation and scenario slices. The dataset spans 8,500 personas, 4,000 SJTs, and 300,000 responses. We will release the dataset and all code to the public.

Problem

Research questions and friction points this paper is trying to address.

Evaluating AI systems in emotionally demanding professional roles

Overcoming limitations of generic personality tests for AI assessment

Creating realistic behavioral personas for domain-specific competency testing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses situational judgment tests for domain competencies

Integrates psychology to design sophisticated personas

Employs structured generation with demographic priors

🔎 Similar Papers

Performance and Metacognition Disconnect when Reasoning in Human-AI Interaction

2024-09-25Citations: 0

Position: AI Evaluation Should Learn from How We Test Humans

2023-06-18Citations: 21

Zillow Group

$104,000.00 - $166,000.00 annually

remote / U.S. (50 states) / California

Authors to Follow