α3-Bench: A Unified Benchmark of Safety, Robustness, and Efficiency for LLM-Based UAV Agents over 6G Networks

📅 2026-01-01
🏛️ arXiv.org
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Current evaluation frameworks struggle to comprehensively assess the safety, protocol compliance, and task effectiveness of large language model (LLM)-driven drone agents under the dynamic constraints of 6G networks. To address this gap, this work proposes α³-Bench, a novel benchmark that integrates safety, robustness, and efficiency into a unified α³ evaluation metric. Built upon a multi-turn conversational control framework, α³-Bench features a dual-action-layer architecture enabling tool invocation and multi-agent collaboration, while incorporating 6G network emulation—including latency, jitter, and packet loss—and tool consistency verification. Evaluation across 17 prominent LLMs on a dataset of 113k dialogues reveals that although most models achieve high task success rates under nominal conditions, their robustness and communication efficiency degrade significantly under impaired 6G network conditions.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) are increasingly used as high level controllers for autonomous Unmanned Aerial Vehicle (UAV) missions. However, existing evaluations rarely assess whether such agents remain safe, protocol compliant, and effective under realistic next generation networking constraints. This paper introduces $\alpha^3$-Bench, a benchmark for evaluating LLM driven UAV autonomy as a multi turn conversational reasoning and control problem operating under dynamic 6G conditions. Each mission is formulated as a language mediated control loop between an LLM based UAV agent and a human operator, where decisions must satisfy strict schema validity, mission policies, speaker alternation, and safety constraints while adapting to fluctuating network slices, latency, jitter, packet loss, throughput, and edge load variations. To reflect modern agentic workflows, $\alpha^3$-Bench integrates a dual action layer supporting both tool calls and agent to agent coordination, enabling evaluation of tool use consistency and multi agent interactions. We construct a large scale corpus of 113k conversational UAV episodes grounded in UAVBench scenarios and evaluate 17 state of the art LLMs using a fixed subset of 50 episodes per scenario under deterministic decoding. We propose a composite $\alpha^3$ metric that unifies six pillars: Task Outcome, Safety Policy, Tool Consistency, Interaction Quality, Network Robustness, and Communication Cost, with efficiency normalized scores per second and per thousand tokens. Results show that while several models achieve high mission success and safety compliance, robustness and efficiency vary significantly under degraded 6G conditions, highlighting the need for network aware and resource efficient LLM based UAV agents. The dataset is publicly available on GitHub : https://github.com/maferrag/AlphaBench
Problem

Research questions and friction points this paper is trying to address.

LLM-based UAV agents
6G networks
safety
robustness
efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-based UAV agents
6G network constraints
multi-agent coordination
tool-use consistency
composite benchmark metric
🔎 Similar Papers
No similar papers found.