FORTRESS: Frontier Risk Evaluation for National Security and Public Safety

📅 2025-06-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM safety evaluation benchmarks lack systematic assessment of robustness against National Security and Public Safety (NSPS) risks. To address this gap, this paper introduces the first NSPS-oriented risk assessment benchmark for frontier large language models. It covers ten subcategories across three critical domains—CBRNE (Chemical, Biological, Radiological, Nuclear, Explosive), political violence/terrorism, and criminal/financial illicit activities—comprising 500 expert-crafted adversarial prompts and structured binary scoring criteria. We propose a novel instance-level automated evaluation framework that jointly quantifies Attack Risk Score (ARS) and Over-Rejection Score (ORS). Additionally, we publicly release the first NSPS-specific dataset with benign对照 prompts and provide two distinct evaluation splits: public and private. Experimental results reveal pronounced safety trade-offs among leading models: DeepSeek-R1 exhibits the highest risk (ARS = 78.05) but minimal over-rejection (ORS = 0.06); Claude-3.5-Sonnet is most conservative (ORS = 21.8); and Gemini 2.5 Pro demonstrates elevated risk (ARS = 66.29).

Technology Category

Application Category

📝 Abstract
The rapid advancement of large language models (LLMs) introduces dual-use capabilities that could both threaten and bolster national security and public safety (NSPS). Models implement safeguards to protect against potential misuse relevant to NSPS and allow for benign users to receive helpful information. However, current benchmarks often fail to test safeguard robustness to potential NSPS risks in an objective, robust way. We introduce FORTRESS: 500 expert-crafted adversarial prompts with instance-based rubrics of 4-7 binary questions for automated evaluation across 3 domains (unclassified information only): Chemical, Biological, Radiological, Nuclear and Explosive (CBRNE), Political Violence&Terrorism, and Criminal&Financial Illicit Activities, with 10 total subcategories across these domains. Each prompt-rubric pair has a corresponding benign version to test for model over-refusals. This evaluation of frontier LLMs' safeguard robustness reveals varying trade-offs between potential risks and model usefulness: Claude-3.5-Sonnet demonstrates a low average risk score (ARS) (14.09 out of 100) but the highest over-refusal score (ORS) (21.8 out of 100), while Gemini 2.5 Pro shows low over-refusal (1.4) but a high average potential risk (66.29). Deepseek-R1 has the highest ARS at 78.05, but the lowest ORS at only 0.06. Models such as o1 display a more even trade-off between potential risks and over-refusals (with an ARS of 21.69 and ORS of 5.2). To provide policymakers and researchers with a clear understanding of models' potential risks, we publicly release FORTRESS at https://huggingface.co/datasets/ScaleAI/fortress_public. We also maintain a private set for evaluation.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLM safeguard robustness for national security risks
Testing model over-refusals versus potential misuse in NSPS
Providing objective benchmarks for dual-use LLM capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Expert-crafted adversarial prompts for evaluation
Automated rubrics with binary questions
Public dataset for model risk assessment
🔎 Similar Papers
No similar papers found.
C
Christina Q. Knight
Scale AI
Kaustubh Deshpande
Kaustubh Deshpande
Scale AI
V
Ved Sirdeshmukh
Scale AI
M
Meher Mankikar
Scale AI
S
Scale Red Team
Scale AI
S
Seal Research Team
Scale AI
Julian Michael
Julian Michael
Scale AI
AI AlignmentComputational LinguisticsNatural Language ProcessingFormal Semantics