CirrusBench: Evaluating LLM-based Agents Beyond Correctness in Real-World Cloud Service Environments

📅 2026-03-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current evaluations of large language model (LLM) agents predominantly rely on synthetic environments, which fail to capture the diversity, unpredictability, and stringent efficiency demands of real-world cloud service customer requests. This work proposes the first evaluation framework grounded in authentic cloud service tickets, preserving multi-turn reasoning chains and tool-call dependencies inherent in actual workflows. It introduces novel customer-centric metrics—such as the normalized efficiency index and multi-turn latency—to systematically assess agent utility across both service quality and response efficiency. Experimental results demonstrate that, despite their strong reasoning capabilities, state-of-the-art models still fall short of meeting the high-efficiency requirements of complex, real-world multi-turn customer service tasks.
📝 Abstract
The increasing agentic capabilities of Large Language Models (LLMs) have enabled their deployment in real-world applications, such as cloud services, where customer-assistant interactions exhibit high technical complexity and long-horizon dependencies, making robustness and resolution efficiency critical for customer satisfaction. However, existing benchmarks for LLM-based agents largely rely on synthetic environments that fail to capture the diversity and unpredictability of authentic customer inputs, often ignoring the resolution efficiency essential for real-world deployment. To bridge this gap, we introduce CirrusBench, a novel evaluation framework distinguished by its foundation in real-world data from authentic cloud service tickets. CirrusBench preserves the intricate multi-turn logical chains and realistic tool dependencies inherent to technical service environments. Moving beyond execution correctness, we introduce novel Customer-Centric metrics to define agent success, quantifying service quality through metrics such as the Normalized Efficiency Index and Multi-Turn Latency to explicitly measure resolution efficiency. Experiments utilizing our framework reveal that while state-of-the-art models demonstrate strong reasoning capabilities, they frequently struggle in complex, realistic multi-turn tasks and fail to meet the high-efficiency standards required for customer service, highlighting critical directions for the future development of LLM-based agents in practical technical service applications. CirrusBench evaluation framework is released at: https://github.com/CirrusAI
Problem

Research questions and friction points this paper is trying to address.

LLM-based agents
cloud service
evaluation benchmark
resolution efficiency
real-world data
Innovation

Methods, ideas, or system contributions that make the work stand out.

CirrusBench
LLM-based agents
real-world evaluation
resolution efficiency
customer-centric metrics
🔎 Similar Papers
No similar papers found.
Yi Yu
Yi Yu
Graduate School of Advanced Science and Engineering at Hiroshima University
Multimodal learningGenerative modelingMultimediaAI Music
G
Guangquan Hu
Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
C
Chenghuang Shen
Alibaba Group, Hangzhou, China
X
Xingyan Liu
Alibaba Group, Hangzhou, China
J
Jing Gu
School of Mathematics and Sciences, Fudan University, Shanghai, China
H
Hangyi Sun
School of Mathematics and Sciences, Fudan University, Shanghai, China
J
Junzhuo Ma
School of Mathematics and Sciences, Fudan University, Shanghai, China
W
Weiting Liu
Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
J
Jianfeng Liu
Alibaba Group, Hangzhou, China
M
Mingyue Pu
Alibaba Group, Hangzhou, China
Yu Wang
Yu Wang
Alibaba
Computer scienceMathematics
Z
Zhengdong Xiao
Alibaba Group, Hangzhou, China
R
Rui Xie
Alibaba Group, Hangzhou, China
L
Longjiu Luo
Alibaba Group, Hangzhou, China
Q
Qianrong Wang
Alibaba Group, Hangzhou, China
G
Gurong Cui
Alibaba Group, Hangzhou, China
H
Honglin Qiao
Alibaba Group, Hangzhou, China
Wenlian Lu
Wenlian Lu
Professor of Mathematics, Fudan University
Neural NetworksComplex NetworksDynamical Systems