PilotBench: A Benchmark for General Aviation Agents with Safety Constraints

📅 2026-04-10

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

This study addresses the reliability of large language models (LLMs) in performing complex physical reasoning and adhering to safety constraints within safety-critical general aviation scenarios. The authors introduce the first aviation-oriented LLM evaluation benchmark, constructed from 708 real flight trajectories spanning nine flight phases and 34 telemetry channels. They propose Pilot-Score, a composite metric integrating regression accuracy with compliance to instructions and safety protocols. Systematic evaluation of 41 models reveals that while conventional predictors achieve low mean absolute error (MAE ≈ 7.01), they lack semantic understanding; in contrast, LLMs attain instruction-following rates of 86–89% but exhibit higher MAE (11–14) and notably degraded performance during high-dynamic phases such as climb and approach, exposing vulnerabilities in their implicit physical modeling. The work advocates for hybrid architectures combining symbolic reasoning with numerical prediction.

Technology Category

Application Category

📝 Abstract

As Large Language Models (LLMs) advance toward embodied AI agents operating in physical environments, a fundamental question emerges: can models trained on text corpora reliably reason about complex physics while adhering to safety constraints? We address this through PilotBench, a benchmark evaluating LLMs on safety-critical flight trajectory and attitude prediction. Built from 708 real-world general aviation trajectories spanning nine operationally distinct flight phases with synchronized 34-channel telemetry, PilotBench systematically probes the intersection of semantic understanding and physics-governed prediction through comparative analysis of LLMs and traditional forecasters. We introduce Pilot-Score, a composite metric balancing 60% regression accuracy with 40% instruction adherence and safety compliance. Comparative evaluation across 41 models uncovers a Precision-Controllability Dichotomy: traditional forecasters achieve superior MAE of 7.01 but lack semantic reasoning capabilities, while LLMs gain controllability with 86--89% instruction-following at the cost of 11--14 MAE precision. Phase-stratified analysis further exposes a Dynamic Complexity Gap-LLM performance degrades sharply in high-workload phases such as Climb and Approach, suggesting brittle implicit physics models. These empirical discoveries motivate hybrid architectures combining LLMs'symbolic reasoning with specialized forecasters'numerical precision. PilotBench provides a rigorous foundation for advancing embodied AI in safety-constrained domains.

Problem

Research questions and friction points this paper is trying to address.

embodied AI

safety constraints

flight trajectory prediction

physics reasoning

general aviation

Innovation

Methods, ideas, or system contributions that make the work stand out.

PilotBench

safety-constrained embodied AI

Precision-Controllability Dichotomy