🤖 AI Summary
This study addresses a critical gap in the safety evaluation of large language models (LLMs), which has predominantly focused on civilian risks while neglecting military-specific legal and ethical norms such as the laws of armed conflict, rules of engagement, and joint ethics regulations—essential for assessing compliance in defense-related decision-making. To bridge this gap, the authors introduce the first structured safety alignment benchmark tailored for military applications. Organized around the OODA (Observe–Orient–Decide–Act) decision framework, the benchmark comprises 12 problem categories and a dataset of 519 multiple-choice questions generated through doctrinal text extraction and semantics-preserving question synthesis. A systematic evaluation of 21 leading commercial LLMs reveals significant deficiencies in military safety alignment, providing crucial insights for improving the compliance of AI systems deployed in defense contexts.
📝 Abstract
Large language models (LLMs) are now being explored for defense applications that require reliable and legally compliant decision support. They also hold significant potential to enhance decision making, coordination, and operational efficiency in military contexts. These uses demand evaluation methods that reflect the doctrinal standards that guide real military operations. Existing safety benchmarks focus on general social risks and do not test whether models follow the legal and ethical rules that govern real military operations. To address this gap, we introduce ARMOR 2025, a military aligned safety benchmark grounded in three core military doctrines the Law of War, the Rules of Engagement, and the Joint Ethics Regulation. We extract doctrinal text from these sources and generate multiple choice questions that preserve the intended meaning of each rule. The benchmark is organized through a taxonomy informed by the Observe Orient Decide Act (OODA) decision making framework. This structure enables systematic testing of accuracy and refusal across military relevant decision types. This benchmark features a structured 12-category taxonomy, 519 doctrinally grounded prompts, and rigorous evaluation procedures applied to 21 commercial LLMs. Evaluation results reveal critical gaps in safety alignment for military applications.