🤖 AI Summary
Human evaluation of agentic AI shopping assistants is labor-intensive and difficult to scale. Method: This paper introduces the first quantitative assessment of LLM-based agents as digital twins simulating authentic multi-turn user interactions, integrating real-user trajectory collection, fine-grained persona modeling, and LLM agent simulation, evaluated via pairwise behavioral consistency analysis. Results: Despite broader exploration, LLM agents exhibit high fidelity to human users in decision paths, interaction pacing, and feedback content (mean behavioral similarity = 0.82) and reproduce 92% of key design insights. This work establishes the validity and scalability of LLM-based digital twins for automated AI system evaluation and provides the first empirical benchmark and methodological framework for behavior-fidelity-driven evaluation.
📝 Abstract
Agentic AI is emerging, capable of executing tasks through natural language, such as Copilot for coding or Amazon Rufus for shopping. Evaluating these systems is challenging, as their rapid evolution outpaces traditional human evaluation. Researchers have proposed LLM Agents to simulate participants as digital twins, but it remains unclear to what extent a digital twin can represent a specific customer in multi-turn interaction with an agentic AI system. In this paper, we recruited 40 human participants to shop with Amazon Rufus, collected their personas, interaction traces, and UX feedback, and then created digital twins to repeat the task. Pairwise comparison of human and digital-twin traces shows that while agents often explored more diverse choices, their action patterns aligned with humans and yielded similar design feedback. This study is the first to quantify how closely LLM agents can mirror human multi-turn interaction with an agentic AI system, highlighting their potential for scalable evaluation.