CRMArena-Pro: Holistic Assessment of LLM Agents Across Diverse Business Scenarios and Interactions

📅 2025-05-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing AI agent benchmarks suffer from insufficient real-world commercial data, lack of multi-turn persona-driven interactions, and inadequate compliance evaluation, thus failing to reflect enterprise operational complexity. This paper introduces the first comprehensive LLM-agent benchmark explicitly designed for authentic enterprise applications, covering 19 expert-validated tasks across B2B/B2C sales, customer service, and CPQ (Configure-Price-Quote). We propose a novel three-dimensional evaluation framework integrating (1) multi-turn persona-driven interaction, (2) industry-process authenticity, and (3) quantitative confidentiality assessment—featuring CRM-aware task design, multi-role dialogue state tracking, automated confidentiality violation detection, and multi-granularity success metrics. Experiments reveal that top-tier models achieve only 58% single-turn success, dropping sharply to 35% in multi-turn settings; workflow execution exceeds 83%, yet native confidentiality adherence is near zero—indicating that prompt engineering for confidentiality often degrades task performance.

Technology Category

Application Category

📝 Abstract
While AI agents hold transformative potential in business, effective performance benchmarking is hindered by the scarcity of public, realistic business data on widely used platforms. Existing benchmarks often lack fidelity in their environments, data, and agent-user interactions, with limited coverage of diverse business scenarios and industries. To address these gaps, we introduce CRMArena-Pro, a novel benchmark for holistic, realistic assessment of LLM agents in diverse professional settings. CRMArena-Pro expands on CRMArena with nineteen expert-validated tasks across sales, service, and 'configure, price, and quote' processes, for both Business-to-Business and Business-to-Customer scenarios. It distinctively incorporates multi-turn interactions guided by diverse personas and robust confidentiality awareness assessments. Experiments reveal leading LLM agents achieve only around 58% single-turn success on CRMArena-Pro, with performance dropping significantly to approximately 35% in multi-turn settings. While Workflow Execution proves more tractable for top agents (over 83% single-turn success), other evaluated business skills present greater challenges. Furthermore, agents exhibit near-zero inherent confidentiality awareness; though targeted prompting can improve this, it often compromises task performance. These findings highlight a substantial gap between current LLM capabilities and enterprise demands, underscoring the need for advancements in multi-turn reasoning, confidentiality adherence, and versatile skill acquisition.
Problem

Research questions and friction points this paper is trying to address.

Lack of realistic business data for AI agent benchmarking
Insufficient coverage of diverse business scenarios and industries
Low performance in multi-turn interactions and confidentiality awareness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Expert-validated tasks across diverse business scenarios
Multi-turn interactions guided by diverse personas
Robust confidentiality awareness assessments
🔎 Similar Papers
No similar papers found.
K
Kung-Hsiang Huang
Salesforce AI Research
A
Akshara Prabhakar
Salesforce AI Research
O
Onkar Thorat
Salesforce AI Research
D
Divyansh Agarwal
Salesforce AI Research
Prafulla Kumar Choubey
Prafulla Kumar Choubey
Salesforce AI Research
Natural Language ProcessingMachine Learning
Yixin Mao
Yixin Mao
Salesforce AI Research
Silvio Savarese
Silvio Savarese
Associate Professor of Computer Science at Stanford University
Computer vision
Caiming Xiong
Caiming Xiong
Salesforce Research
Machine LearningNLPComputer VisionMultimediaData Mining
C
Chien-Sheng Wu
Salesforce AI Research