๐ค AI Summary
Current air traffic control (ATC) language understanding systems lack differentiated evaluation of high-risk semantic errors, compromising reliability in safety-critical scenarios. This work proposes the first consequence-aware evaluation framework tailored to ATC, moving beyond conventional uniform error metrics by introducing a risk-scoring mechanism that enables fine-grained, consequence-oriented quantification of entity recognition errors. The framework integrates operational impact modeling and validates performance using macro F1 scores. Experimental results reveal that despite high macro accuracy, state-of-the-art large language models consistently achieve risk scores below 0.6โpeaking at only 0.69โexposing severe reliability deficiencies in interpreting critical instructions and highlighting structural shortcomings in high-stakes semantic understanding.
๐ Abstract
Air Traffic Control (ATC) is a safety-critical domain in which incorrect interpretation of instructions may lead to severe operational consequences. While large language models (LLMs) demonstrate strong general performance, their reliability in operational ATC environments remains unclear. Existing evaluation approaches, largely based on aggregate metrics such as F1 or macro accuracy, treat all errors uniformly and fail to account for the asymmetric consequences of high-risk semantic mistakes (e.g., incorrect runway identifiers or movement constraints). To address this gap, we propose a safety-oriented, consequence-aware evaluation framework tailored to ATC operations. Our results reveal that while current LLMs achieve reasonable aggregate accuracy, their operational reliability is severely limited. Evaluated on clean transcripts, the peak Risk Score reaches only 0.69, with most models scoring below 0.6 despite high macro-F1 performance. Further analysis shows that errors concentrate in high-impact entities despite relatively stable action-type classification, indicating structural grounding deficiencies. These findings highlight the necessity of consequence-aware evaluation protocols for the responsible deployment of AI-assisted ATC systems.