SOSBENCH: Benchmarking Safety Alignment on Scientific Knowledge

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing safety benchmarks inadequately assess large language models’ (LLMs) misuse prevention capabilities in high-risk, science-knowledge-intensive domains—such as chemistry, biology, and medicine—where regulatory compliance is critical. To address this gap, we introduce SciSafeBench, the first regulation-driven scientific safety alignment benchmark. It spans six core scientific disciplines and comprises 3,000 real-world, high-risk prompts generated via a novel “regulatory text mining + LLM-assisted evolutionary prompting” paradigm. We further develop a multidimensional automated evaluation framework and a cross-model standardized assessment protocol. Experimental results reveal severe safety alignment failures across state-of-the-art models: DeepSeek-R1 produces harmful responses in 79.1% of cases, while GPT-4.1 exhibits a 47.3% harmful response rate. This work constitutes the first systematic empirical investigation of scientific-domain LLM misuse risks, establishing both a new evaluation standard and a foundational methodology for safety assessment in high-stakes scientific knowledge applications.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) exhibit advancing capabilities in complex tasks, such as reasoning and graduate-level question answering, yet their resilience against misuse, particularly involving scientifically sophisticated risks, remains underexplored. Existing safety benchmarks typically focus either on instructions requiring minimal knowledge comprehension (e.g., ``tell me how to build a bomb") or utilize prompts that are relatively low-risk (e.g., multiple-choice or classification tasks about hazardous content). Consequently, they fail to adequately assess model safety when handling knowledge-intensive, hazardous scenarios. To address this critical gap, we introduce SOSBench, a regulation-grounded, hazard-focused benchmark encompassing six high-risk scientific domains: chemistry, biology, medicine, pharmacology, physics, and psychology. The benchmark comprises 3,000 prompts derived from real-world regulations and laws, systematically expanded via an LLM-assisted evolutionary pipeline that introduces diverse, realistic misuse scenarios (e.g., detailed explosive synthesis instructions involving advanced chemical formulas). We evaluate frontier models within a unified evaluation framework using our SOSBench. Despite their alignment claims, advanced models consistently disclose policy-violating content across all domains, demonstrating alarmingly high rates of harmful responses (e.g., 79.1% for Deepseek-R1 and 47.3% for GPT-4.1). These results highlight significant safety alignment deficiencies and underscore urgent concerns regarding the responsible deployment of powerful LLMs.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLM safety in knowledge-intensive hazardous scenarios

Evaluating model resilience against scientifically sophisticated risks

Identifying safety alignment gaps in high-risk scientific domains

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM-assisted evolutionary pipeline for diverse misuse scenarios

Regulation-grounded benchmark in six high-risk scientific domains

Unified evaluation framework for model safety assessment

🔎 Similar Papers

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?