🤖 AI Summary
This study addresses the reliability of large language models (LLMs) in performing complex physical reasoning and adhering to safety constraints within safety-critical general aviation scenarios. The authors introduce the first aviation-oriented LLM evaluation benchmark, constructed from 708 real flight trajectories spanning nine flight phases and 34 telemetry channels. They propose Pilot-Score, a composite metric integrating regression accuracy with compliance to instructions and safety protocols. Systematic evaluation of 41 models reveals that while conventional predictors achieve low mean absolute error (MAE ≈ 7.01), they lack semantic understanding; in contrast, LLMs attain instruction-following rates of 86–89% but exhibit higher MAE (11–14) and notably degraded performance during high-dynamic phases such as climb and approach, exposing vulnerabilities in their implicit physical modeling. The work advocates for hybrid architectures combining symbolic reasoning with numerical prediction.
📝 Abstract
As Large Language Models (LLMs) advance toward embodied AI agents operating in physical environments, a fundamental question emerges: can models trained on text corpora reliably reason about complex physics while adhering to safety constraints? We address this through PilotBench, a benchmark evaluating LLMs on safety-critical flight trajectory and attitude prediction. Built from 708 real-world general aviation trajectories spanning nine operationally distinct flight phases with synchronized 34-channel telemetry, PilotBench systematically probes the intersection of semantic understanding and physics-governed prediction through comparative analysis of LLMs and traditional forecasters. We introduce Pilot-Score, a composite metric balancing 60% regression accuracy with 40% instruction adherence and safety compliance. Comparative evaluation across 41 models uncovers a Precision-Controllability Dichotomy: traditional forecasters achieve superior MAE of 7.01 but lack semantic reasoning capabilities, while LLMs gain controllability with 86--89% instruction-following at the cost of 11--14 MAE precision. Phase-stratified analysis further exposes a Dynamic Complexity Gap-LLM performance degrades sharply in high-workload phases such as Climb and Approach, suggesting brittle implicit physics models. These empirical discoveries motivate hybrid architectures combining LLMs'symbolic reasoning with specialized forecasters'numerical precision. PilotBench provides a rigorous foundation for advancing embodied AI in safety-constrained domains.