🤖 AI Summary
Existing safety benchmarks inadequately assess large language models’ (LLMs) misuse prevention capabilities in high-risk, science-knowledge-intensive domains—such as chemistry, biology, and medicine—where regulatory compliance is critical. To address this gap, we introduce SciSafeBench, the first regulation-driven scientific safety alignment benchmark. It spans six core scientific disciplines and comprises 3,000 real-world, high-risk prompts generated via a novel “regulatory text mining + LLM-assisted evolutionary prompting” paradigm. We further develop a multidimensional automated evaluation framework and a cross-model standardized assessment protocol. Experimental results reveal severe safety alignment failures across state-of-the-art models: DeepSeek-R1 produces harmful responses in 79.1% of cases, while GPT-4.1 exhibits a 47.3% harmful response rate. This work constitutes the first systematic empirical investigation of scientific-domain LLM misuse risks, establishing both a new evaluation standard and a foundational methodology for safety assessment in high-stakes scientific knowledge applications.
📝 Abstract
Large language models (LLMs) exhibit advancing capabilities in complex tasks, such as reasoning and graduate-level question answering, yet their resilience against misuse, particularly involving scientifically sophisticated risks, remains underexplored. Existing safety benchmarks typically focus either on instructions requiring minimal knowledge comprehension (e.g., ``tell me how to build a bomb") or utilize prompts that are relatively low-risk (e.g., multiple-choice or classification tasks about hazardous content). Consequently, they fail to adequately assess model safety when handling knowledge-intensive, hazardous scenarios. To address this critical gap, we introduce SOSBench, a regulation-grounded, hazard-focused benchmark encompassing six high-risk scientific domains: chemistry, biology, medicine, pharmacology, physics, and psychology. The benchmark comprises 3,000 prompts derived from real-world regulations and laws, systematically expanded via an LLM-assisted evolutionary pipeline that introduces diverse, realistic misuse scenarios (e.g., detailed explosive synthesis instructions involving advanced chemical formulas). We evaluate frontier models within a unified evaluation framework using our SOSBench. Despite their alignment claims, advanced models consistently disclose policy-violating content across all domains, demonstrating alarmingly high rates of harmful responses (e.g., 79.1% for Deepseek-R1 and 47.3% for GPT-4.1). These results highlight significant safety alignment deficiencies and underscore urgent concerns regarding the responsible deployment of powerful LLMs.