$C^3$-Bench: The Things Real Disturbing LLM based Agent in Multi-Tasking

📅 2025-05-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM-based multi-task agent evaluations overlook the impact of tool dependency, environmental feedback, and historical decision-making on robustness. This paper introduces the first open-source benchmark explicitly designed for multi-task robustness, targeting three core challenges: tool dependency modeling, implicit information perception, and dynamic decision-path adaptation—thereby moving beyond conventional conversational evaluation paradigms. Methodologically, it innovatively integrates adversarial attack principles with single-variable attribution analysis to design fine-grained metrics, environment-feedback-driven data collection algorithms, and a standardized evaluation protocol. Key technical components include tool relational graphs, long-context perturbations, and strategy-switching stress tests. Evaluated across 49 state-of-the-art LLM agents, the benchmark systematically uncovers previously unreported structural vulnerabilities—in particular, in tool dependency modeling, long-context reasoning, and high-frequency policy switching—providing the first empirical evidence of these robustness bottlenecks.

Technology Category

Application Category

📝 Abstract
Agents based on large language models leverage tools to modify environments, revolutionizing how AI interacts with the physical world. Unlike traditional NLP tasks that rely solely on historical dialogue for responses, these agents must consider more complex factors, such as inter-tool relationships, environmental feedback and previous decisions, when making choices. Current research typically evaluates agents via multi-turn dialogues. However, it overlooks the influence of these critical factors on agent behavior. To bridge this gap, we present an open-source and high-quality benchmark $C^3$-Bench. This benchmark integrates attack concepts and applies univariate analysis to pinpoint key elements affecting agent robustness. In concrete, we design three challenges: navigate complex tool relationships, handle critical hidden information and manage dynamic decision paths. Complementing these challenges, we introduce fine-grained metrics, innovative data collection algorithms and reproducible evaluation methods. Extensive experiments are conducted on 49 mainstream agents, encompassing general fast-thinking, slow-thinking and domain-specific models. We observe that agents have significant shortcomings in handling tool dependencies, long context information dependencies and frequent policy-type switching. In essence, $C^3$-Bench aims to expose model vulnerabilities through these challenges and drive research into the interpretability of agent performance. The benchmark is publicly available at https://github.com/yupeijei1997/C3-Bench.
Problem

Research questions and friction points this paper is trying to address.

Evaluate LLM agents' robustness in multi-tasking environments
Assess agent performance under tool dependencies and dynamic decisions
Expose model vulnerabilities through interpretable benchmark challenges
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open-source benchmark for agent robustness
Univariate analysis to identify key factors
Fine-grained metrics and evaluation methods
🔎 Similar Papers
No similar papers found.
P
Peijie Yu
Tencent HunYuan Team
Y
Yifan Yang
Tencent HunYuan Team
J
Jinjian Li
Tencent HunYuan Team
Z
Zelong Zhang
Tencent HunYuan Team
Haorui Wang
Haorui Wang
PhD student, Gatech
Machine LearningLarge Language ModelsDecision MakingUncertainty Quantification
X
Xiao Feng
Tencent HunYuan Team
F
Feng Zhang
Tencent HunYuan Team