Feedback Forensics: A Toolkit to Measure AI Personality

📅 2025-09-30

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

Current AI model personality evaluation lacks publicly available, transparent quantitative tools, hindering detection of personality biases and potentially triggering model withdrawals or overfitting in feedback-based leaderboards. This paper introduces Feedback Forensics—a first-of-its-kind systematic framework for tracing AI personality evolution. It integrates AI annotators, LLM response comparison, and psychometric trait-scoring algorithms, leveraging mainstream feedback datasets (Chatbot Arena, MultiPref, PRISM) to quantitatively assess how human and AI feedback shape model personality. We release an open-source toolkit—including a Python API and an interactive web application—that enables reproducible, interpretable visualization of personality tendencies. Empirically, we uncover previously undocumented latent risks across major models, including sycophancy and consistency drift under varying feedback mechanisms. This work advances AI evaluation from opaque, black-box assessment toward explainable, auditable, and scientifically grounded personality analysis.

Technology Category

Application Category

📝 Abstract

Some traits making a "good" AI model are hard to describe upfront. For example, should responses be more polite or more casual? Such traits are sometimes summarized as model character or personality. Without a clear objective, conventional benchmarks based on automatic validation struggle to measure such traits. Evaluation methods using human feedback such as Chatbot Arena have emerged as a popular alternative. These methods infer "better" personality and other desirable traits implicitly by ranking multiple model responses relative to each other. Recent issues with model releases highlight limitations of these existing opaque evaluation approaches: a major model was rolled back over sycophantic personality issues, models were observed overfitting to such feedback-based leaderboards. Despite these known issues, limited public tooling exists to explicitly evaluate model personality. We introduce Feedback Forensics: an open-source toolkit to track AI personality changes, both those encouraged by human (or AI) feedback, and those exhibited across AI models trained and evaluated on such feedback. Leveraging AI annotators, our toolkit enables investigating personality via Python API and browser app. We demonstrate the toolkit's usefulness in two steps: (A) first we analyse the personality traits encouraged in popular human feedback datasets including Chatbot Arena, MultiPref and PRISM; and (B) then use our toolkit to analyse how much popular models exhibit such traits. We release (1) our Feedback Forensics toolkit alongside (2) a web app tracking AI personality in popular models and feedback datasets as well as (3) the underlying annotation data at https://github.com/rdnfn/feedback-forensics.

Problem

Research questions and friction points this paper is trying to address.

Measuring AI personality traits without clear evaluation objectives

Addressing limitations of opaque human feedback evaluation methods

Providing tools to track personality changes across AI models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source toolkit tracks AI personality changes

Uses AI annotators for personality investigation

Provides Python API and browser app interface

🔎 Similar Papers

Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics