LLM Agent Meets Agentic AI: Can LLM Agents Simulate Customers to Evaluate Agentic-AI-based Shopping Assistants?

📅 2025-09-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Human evaluation of agentic AI shopping assistants is labor-intensive and difficult to scale. Method: This paper introduces the first quantitative assessment of LLM-based agents as digital twins simulating authentic multi-turn user interactions, integrating real-user trajectory collection, fine-grained persona modeling, and LLM agent simulation, evaluated via pairwise behavioral consistency analysis. Results: Despite broader exploration, LLM agents exhibit high fidelity to human users in decision paths, interaction pacing, and feedback content (mean behavioral similarity = 0.82) and reproduce 92% of key design insights. This work establishes the validity and scalability of LLM-based digital twins for automated AI system evaluation and provides the first empirical benchmark and methodological framework for behavior-fidelity-driven evaluation.

Technology Category

Application Category

📝 Abstract
Agentic AI is emerging, capable of executing tasks through natural language, such as Copilot for coding or Amazon Rufus for shopping. Evaluating these systems is challenging, as their rapid evolution outpaces traditional human evaluation. Researchers have proposed LLM Agents to simulate participants as digital twins, but it remains unclear to what extent a digital twin can represent a specific customer in multi-turn interaction with an agentic AI system. In this paper, we recruited 40 human participants to shop with Amazon Rufus, collected their personas, interaction traces, and UX feedback, and then created digital twins to repeat the task. Pairwise comparison of human and digital-twin traces shows that while agents often explored more diverse choices, their action patterns aligned with humans and yielded similar design feedback. This study is the first to quantify how closely LLM agents can mirror human multi-turn interaction with an agentic AI system, highlighting their potential for scalable evaluation.
Problem

Research questions and friction points this paper is trying to address.

Evaluating agentic AI systems using scalable LLM-based customer simulations
Quantifying how well digital twins replicate human multi-turn interactions
Assessing LLM agents' ability to provide valid UX design feedback
Innovation

Methods, ideas, or system contributions that make the work stand out.

Using LLM agents to simulate customer interactions
Comparing human and agent traces for evaluation
Quantifying agent-human alignment in multi-turn tasks
🔎 Similar Papers
No similar papers found.