WirelessBench: A Tolerance-Aware LLM Agent Benchmark for Wireless Network Intelligence

📅 2026-03-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses critical limitations in existing benchmarks for wireless network intelligence, which overlook engineering risks and fail to evaluate catastrophic errors such as cascading failures and unit confusion (e.g., dB vs. dBm), while lacking metrics for fault tolerance and tool-use capability. To bridge this gap, we propose the first fault-tolerant, tool-integrated LLM agent benchmark tailored for wireless networks, featuring a three-tier cognitive architecture encompassing domain-knowledge reasoning, intent-driven resource allocation, and proactive multi-step decision-making in mobile scenarios. Our framework enables fine-grained diagnosis of catastrophic errors and reasoning breakdowns through a fault-tolerance scoring mechanism, mandatory tool-invocation tasks, 3GPP-compliant ray-tracing queries, and traceable chain-of-thought annotations. Experiments demonstrate that tool-integrated agents achieve 84.64% accuracy—16.64% higher than direct prompting—and uncover 23% more catastrophic failures missed by conventional metrics, enabling actionable error attribution across four distinct categories.

Technology Category

Application Category

📝 Abstract
LLM agents are emerging as a key enabler for autonomous wireless network management. Reliably deploying them, however, demands benchmarks that reflect real engineering risk. Existing wireless benchmarks evaluate single isolated capabilities and treat all errors uniformly, missing both cascaded-chain failures and catastrophic unit confusions (\textit{e.g.}, dB vs.\ dBm). We present \wb{}, the first tolerance-aware, tool-integrated benchmark for LLM-based wireless agents. \wb{} is organized as a three-tier cognitive hierarchy: domain knowledge reasoning (WCHW, 1{,}392 items), intent-driven resource allocation (WCNS, 1{,}000 items), and proactive multi-step decisions under mobility (WCMSA, 1{,}000 items). Moreover, \wb{} is established on three design principles: \emph{(i)}~tolerance-aware scoring with catastrophic-error detection; \emph{(ii)}~tool-necessary tasks requiring a 3GPP-compliant ray-tracing query for channel quality; and \emph{(iii)}~Chain-of-Thought (CoT)-traceable items, where every benchmark item ships with a complete CoT trajectory enabling fine-grained diagnosis of where in the reasoning chain an agent fails. Our numerical results show that the direct-prompting model (GPT-4o) scores $68\%$, trailing a tool-integrated agent ($84.64\%$) by $16.64$\,pp; $23\%$ of errors are catastrophic failures invisible to exact-match metrics. More importantly, the hierarchy decomposes errors into four actionable diagnostic categories that flat evaluation cannot reveal. Code and data: https://wirelessbench.github.io/.
Problem

Research questions and friction points this paper is trying to address.

wireless network intelligence
LLM agent benchmark
catastrophic errors
tolerance-aware evaluation
engineering risk
Innovation

Methods, ideas, or system contributions that make the work stand out.

tolerance-aware benchmark
LLM agent
wireless network intelligence
tool-integrated reasoning
Chain-of-Thought tracing
🔎 Similar Papers
No similar papers found.
J
Jingwen Tong
College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China
Fang Liu
Fang Liu
Computer Science and Engineering, Nanjing University of Science and Technology
Deep learningImage ProcessingRemote SensingSARPolSAR
L
Linkai Xv
College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China
S
Shiliang Lu
College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China
K
Kangqi Li
College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China
Y
Yiqian Zhang
College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China
Y
Yijie Song
College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China
Z
Zeyang Xue
College of Electronics and Information Engineering, Shenzhen University, Shenzhen, China
Jun Zhang
Jun Zhang
Professor, Hong Kong University of Science and Technology, IEEE Fellow
Mobile Edge ComputingEdge AIWireless CommunicationsGenAI