WirelessBench: A Tolerance-Aware LLM Agent Benchmark for Wireless Network Intelligence

📅 2026-03-22

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses critical limitations in existing benchmarks for wireless network intelligence, which overlook engineering risks and fail to evaluate catastrophic errors such as cascading failures and unit confusion (e.g., dB vs. dBm), while lacking metrics for fault tolerance and tool-use capability. To bridge this gap, we propose the first fault-tolerant, tool-integrated LLM agent benchmark tailored for wireless networks, featuring a three-tier cognitive architecture encompassing domain-knowledge reasoning, intent-driven resource allocation, and proactive multi-step decision-making in mobile scenarios. Our framework enables fine-grained diagnosis of catastrophic errors and reasoning breakdowns through a fault-tolerance scoring mechanism, mandatory tool-invocation tasks, 3GPP-compliant ray-tracing queries, and traceable chain-of-thought annotations. Experiments demonstrate that tool-integrated agents achieve 84.64% accuracy—16.64% higher than direct prompting—and uncover 23% more catastrophic failures missed by conventional metrics, enabling actionable error attribution across four distinct categories.

Technology Category

Application Category

📝 Abstract

LLM agents are emerging as a key enabler for autonomous wireless network management. Reliably deploying them, however, demands benchmarks that reflect real engineering risk. Existing wireless benchmarks evaluate single isolated capabilities and treat all errors uniformly, missing both cascaded-chain failures and catastrophic unit confusions (\textit{e.g.}, dB vs.\ dBm). We present \wb{}, the first tolerance-aware, tool-integrated benchmark for LLM-based wireless agents. \wb{} is organized as a three-tier cognitive hierarchy: domain knowledge reasoning (WCHW, 1{,}392 items), intent-driven resource allocation (WCNS, 1{,}000 items), and proactive multi-step decisions under mobility (WCMSA, 1{,}000 items). Moreover, \wb{} is established on three design principles: \emph{(i)}~tolerance-aware scoring with catastrophic-error detection; \emph{(ii)}~tool-necessary tasks requiring a 3GPP-compliant ray-tracing query for channel quality; and \emph{(iii)}~Chain-of-Thought (CoT)-traceable items, where every benchmark item ships with a complete CoT trajectory enabling fine-grained diagnosis of where in the reasoning chain an agent fails. Our numerical results show that the direct-prompting model (GPT-4o) scores $68\%$, trailing a tool-integrated agent ($84.64\%$) by $16.64$\,pp; $23\%$ of errors are catastrophic failures invisible to exact-match metrics. More importantly, the hierarchy decomposes errors into four actionable diagnostic categories that flat evaluation cannot reveal. Code and data: https://wirelessbench.github.io/.

Problem

Research questions and friction points this paper is trying to address.

wireless network intelligence

LLM agent benchmark

catastrophic errors

tolerance-aware evaluation

engineering risk

Innovation

Methods, ideas, or system contributions that make the work stand out.

tolerance-aware benchmark

LLM agent

wireless network intelligence