🤖 AI Summary
Evaluating AI agents for enterprise CRM systems—particularly on complex, real-world workflows—lacks standardized, high-fidelity, and interpretable benchmarks.
Method: We introduce SCUBA, the first high-fidelity, explainable evaluation benchmark tailored for Salesforce. It comprises 300 realistic CRM workflow tasks derived from actual user scenarios, supporting multi-role collaboration, fine-grained behavioral assessment, and parallel execution. Built atop a Salesforce sandbox, SCUBA enables end-to-end evaluation of UI navigation, data manipulation, workflow automation, information retrieval, and fault diagnosis, compatible with both zero-shot and in-context learning paradigms.
Contribution/Results: Experiments show closed-source models achieve 39% task success in zero-shot settings, rising to 50% with in-context examples—while reducing inference latency and cost by 13% and 16%, respectively. SCUBA is the first benchmark to systematically expose the substantial performance gap between open- and closed-source models on enterprise automation tasks, establishing a standardized infrastructure for CRM agent development and evaluation.
📝 Abstract
We introduce SCUBA, a benchmark designed to evaluate computer-use agents on customer relationship management (CRM) workflows within the Salesforce platform. SCUBA contains 300 task instances derived from real user interviews, spanning three primary personas, platform administrators, sales representatives, and service agents. The tasks test a range of enterprise-critical abilities, including Enterprise Software UI navigation, data manipulation, workflow automation, information retrieval, and troubleshooting. To ensure realism, SCUBA operates in Salesforce sandbox environments with support for parallel execution and fine-grained evaluation metrics to capture milestone progress. We benchmark a diverse set of agents under both zero-shot and demonstration-augmented settings. We observed huge performance gaps in different agent design paradigms and gaps between the open-source model and the closed-source model. In the zero-shot setting, open-source model powered computer-use agents that have strong performance on related benchmarks like OSWorld only have less than 5% success rate on SCUBA, while methods built on closed-source models can still have up to 39% task success rate. In the demonstration-augmented settings, task success rates can be improved to 50% while simultaneously reducing time and costs by 13% and 16%, respectively. These findings highlight both the challenges of enterprise tasks automation and the promise of agentic solutions. By offering a realistic benchmark with interpretable evaluation, SCUBA aims to accelerate progress in building reliable computer-use agents for complex business software ecosystems.