Judging with Personality and Confidence: A Study on Personality-Conditioned LLM Relevance Assessment

📅 2026-01-05

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

206K/year

🤖 AI Summary

This study investigates the influence of simulating the Big Five personality traits on the accuracy and confidence calibration of relevance judgments made by large language models (LLMs) in information retrieval tasks. By leveraging prompt engineering to induce distinct personality profiles, the authors systematically evaluate the models’ judgments and self-reported confidence across multiple test collections. Their findings reveal, for the first time, that personality traits modulate both relevance assessment fidelity and confidence bias: low agreeableness yields judgments more aligned with human annotations, while low conscientiousness effectively mitigates over- and under-confidence. Furthermore, integrating personality-conditioned features with confidence scores as complementary signals into a random forest-based estimator achieves superior performance on TREC DL 2021 compared to any single best-performing personality condition, demonstrating significant gains in evaluation effectiveness even under limited training data.

Technology Category

Application Category

📝 Abstract

Recent studies have shown that prompting can enable large language models (LLMs) to simulate specific personality traits and produce behaviors that align with those traits. However, there is limited understanding of how these simulated personalities influence critical web search decisions, specifically relevance assessment. Moreover, few studies have examined how simulated personalities impact confidence calibration, specifically the tendencies toward overconfidence or underconfidence. This gap exists even though psychological literature suggests these biases are trait-specific, often linking high extraversion to overconfidence and high neuroticism to underconfidence. To address this gap, we conducted a comprehensive study evaluating multiple LLMs, including commercial models and open-source models, prompted to simulate Big Five personality traits. We tested these models across three test collections (TREC DL 2019, TREC DL 2020, and LLMJudge), collecting two key outputs for each query-document pair: a relevance judgment and a self-reported confidence score. The findings show that personalities such as low agreeableness consistently align more closely with human labels than the unprompted condition. Additionally, low conscientiousness performs well in balancing the suppression of both overconfidence and underconfidence. We also observe that relevance scores and confidence distributions vary systematically across different personalities. Based on the above findings, we incorporate personality-conditioned scores and confidence as features in a random forest classifier. This approach achieves performance that surpasses the best single-personality condition on a new dataset (TREC DL 2021), even with limited training data. These findings highlight that personality-derived confidence offers a complementary predictive signal, paving the way for more reliable and human-aligned LLM evaluators.

Problem

Research questions and friction points this paper is trying to address.

personality-conditioned LLM

relevance assessment

confidence calibration

Big Five personality traits

overconfidence

Innovation

Methods, ideas, or system contributions that make the work stand out.

personality-conditioned LLM

relevance assessment

confidence calibration