Safety-Oriented Evaluation of Language Understanding Systems for Air Traffic Control

๐Ÿ“… 2026-05-12
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

165K/year
๐Ÿค– AI Summary
Current air traffic control (ATC) language understanding systems lack differentiated evaluation of high-risk semantic errors, compromising reliability in safety-critical scenarios. This work proposes the first consequence-aware evaluation framework tailored to ATC, moving beyond conventional uniform error metrics by introducing a risk-scoring mechanism that enables fine-grained, consequence-oriented quantification of entity recognition errors. The framework integrates operational impact modeling and validates performance using macro F1 scores. Experimental results reveal that despite high macro accuracy, state-of-the-art large language models consistently achieve risk scores below 0.6โ€”peaking at only 0.69โ€”exposing severe reliability deficiencies in interpreting critical instructions and highlighting structural shortcomings in high-stakes semantic understanding.
๐Ÿ“ Abstract
Air Traffic Control (ATC) is a safety-critical domain in which incorrect interpretation of instructions may lead to severe operational consequences. While large language models (LLMs) demonstrate strong general performance, their reliability in operational ATC environments remains unclear. Existing evaluation approaches, largely based on aggregate metrics such as F1 or macro accuracy, treat all errors uniformly and fail to account for the asymmetric consequences of high-risk semantic mistakes (e.g., incorrect runway identifiers or movement constraints). To address this gap, we propose a safety-oriented, consequence-aware evaluation framework tailored to ATC operations. Our results reveal that while current LLMs achieve reasonable aggregate accuracy, their operational reliability is severely limited. Evaluated on clean transcripts, the peak Risk Score reaches only 0.69, with most models scoring below 0.6 despite high macro-F1 performance. Further analysis shows that errors concentrate in high-impact entities despite relatively stable action-type classification, indicating structural grounding deficiencies. These findings highlight the necessity of consequence-aware evaluation protocols for the responsible deployment of AI-assisted ATC systems.
Problem

Research questions and friction points this paper is trying to address.

Air Traffic Control
safety-critical
language understanding
risk-aware evaluation
semantic errors
Innovation

Methods, ideas, or system contributions that make the work stand out.

safety-oriented evaluation
consequence-aware framework
risk score
air traffic control
language understanding systems
๐Ÿ”Ž Similar Papers
No similar papers found.