PICon: A Multi-Turn Interrogation Framework for Evaluating Persona Agent Consistency

📅 2026-03-26

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This work addresses the lack of systematic evaluation for large language model–driven role-playing agents in maintaining consistency across multi-turn interactions, a gap that often leads to logical contradictions and factual errors. The authors propose the PICon framework, which introduces an interrogation-based methodology to construct logically coherent chains of multi-turn questions, enabling a systematic assessment of agent credibility along three dimensions: internal consistency, external consistency, and test–retest consistency. Through controlled experiments involving seven state-of-the-art role-playing agents and 63 human participants, the study reveals that even systems claiming high consistency significantly underperform compared to humans, frequently exhibiting logical inconsistencies and evasive behaviors—thereby exposing critical limitations in current approaches.

Technology Category

Application Category

📝 Abstract

Large language model (LLM)-based persona agents are rapidly being adopted as scalable proxies for human participants across diverse domains. Yet there is no systematic method for verifying whether a persona agent's responses remain free of contradictions and factual inaccuracies throughout an interaction. A principle from interrogation methodology offers a lens: no matter how elaborate a fabricated identity, systematic interrogation will expose its contradictions. We apply this principle to propose PICon, an evaluation framework that probes persona agents through logically chained multi-turn questioning. PICon evaluates consistency along three core dimensions: internal consistency (freedom from self-contradiction), external consistency (alignment with real-world facts), and retest consistency (stability under repetition). Evaluating seven groups of persona agents alongside 63 real human participants, we find that even systems previously reported as highly consistent fail to meet the human baseline across all three dimensions, revealing contradictions and evasive responses under chained questioning. This work provides both a conceptual foundation and a practical methodology for evaluating persona agents before trusting them as substitutes for human participants. We provide the source code and an interactive demo at: https://kaist-edlab.github.io/picon/

Problem

Research questions and friction points this paper is trying to address.

persona agent

consistency evaluation

multi-turn interrogation

factual accuracy

self-contradiction

Innovation

Methods, ideas, or system contributions that make the work stand out.

persona agent

consistency evaluation

multi-turn interrogation

large language model

logical chaining

🔎 Similar Papers

PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation

2024-09-10arXiv.orgCitations: 0

PersonaGym: Evaluating Persona Agents and LLMs

2024-07-25arXiv.orgCitations: 22