🤖 AI Summary
Prior LLM evaluations largely ignore user heterogeneity, assuming uniform preferences across individuals. This study investigates whether personality traits systematically influence users’ preferences for large language models (LLMs) in multi-turn human-AI collaboration.
Method: Grounded in the Keirsey Temperament Sorter, we conducted multi-round interactive experiments across four task domains—data analysis, creative writing, information retrieval, and writing assistance—comparing GPT-4 and Claude 3.5. Preferences were quantified via helpfulness ratings and enriched with qualitative feedback analyzed through sentiment-aware thematic coding.
Contribution/Results: Although overall helpfulness scores showed no statistically significant difference between models, personality type strongly predicted preference: Rational temperament users significantly favored GPT-4, whereas Idealist users preferred Claude 3.5. This is the first empirical demonstration of stable, temperament-based LLM preference patterns. The findings challenge the “one-size-fits-all” evaluation paradigm and establish a user-centered, human factors–informed foundation for personalized LLM adaptation and psychometrically grounded evaluation frameworks.
📝 Abstract
As Large Language Models (LLMs) increasingly integrate into everyday workflows, where users shape outcomes through multi-turn collaboration, a critical question emerges: do users with different personality traits systematically prefer certain LLMs over others? We conducted a study with 32 participants evenly distributed across four Keirsey personality types, evaluating their interactions with GPT-4 and Claude 3.5 across four collaborative tasks: data analysis, creative writing, information retrieval, and writing assistance. Results revealed significant personality-driven preferences: Rationals strongly preferred GPT-4, particularly for goal-oriented tasks, while idealists favored Claude 3.5, especially for creative and analytical tasks. Other personality types showed task-dependent preferences. Sentiment analysis of qualitative feedback confirmed these patterns. Notably, aggregate helpfulness ratings were similar across models, showing how personality-based analysis reveals LLM differences that traditional evaluations miss.