🤖 AI Summary
Existing AI agent benchmarks suffer from insufficient real-world commercial data, lack of multi-turn persona-driven interactions, and inadequate compliance evaluation, thus failing to reflect enterprise operational complexity. This paper introduces the first comprehensive LLM-agent benchmark explicitly designed for authentic enterprise applications, covering 19 expert-validated tasks across B2B/B2C sales, customer service, and CPQ (Configure-Price-Quote). We propose a novel three-dimensional evaluation framework integrating (1) multi-turn persona-driven interaction, (2) industry-process authenticity, and (3) quantitative confidentiality assessment—featuring CRM-aware task design, multi-role dialogue state tracking, automated confidentiality violation detection, and multi-granularity success metrics. Experiments reveal that top-tier models achieve only 58% single-turn success, dropping sharply to 35% in multi-turn settings; workflow execution exceeds 83%, yet native confidentiality adherence is near zero—indicating that prompt engineering for confidentiality often degrades task performance.
📝 Abstract
While AI agents hold transformative potential in business, effective performance benchmarking is hindered by the scarcity of public, realistic business data on widely used platforms. Existing benchmarks often lack fidelity in their environments, data, and agent-user interactions, with limited coverage of diverse business scenarios and industries. To address these gaps, we introduce CRMArena-Pro, a novel benchmark for holistic, realistic assessment of LLM agents in diverse professional settings. CRMArena-Pro expands on CRMArena with nineteen expert-validated tasks across sales, service, and 'configure, price, and quote' processes, for both Business-to-Business and Business-to-Customer scenarios. It distinctively incorporates multi-turn interactions guided by diverse personas and robust confidentiality awareness assessments. Experiments reveal leading LLM agents achieve only around 58% single-turn success on CRMArena-Pro, with performance dropping significantly to approximately 35% in multi-turn settings. While Workflow Execution proves more tractable for top agents (over 83% single-turn success), other evaluated business skills present greater challenges. Furthermore, agents exhibit near-zero inherent confidentiality awareness; though targeted prompting can improve this, it often compromises task performance. These findings highlight a substantial gap between current LLM capabilities and enterprise demands, underscoring the need for advancements in multi-turn reasoning, confidentiality adherence, and versatile skill acquisition.