π€ AI Summary
This study investigates whether the safety mechanisms of current large language models remain robust against harmful instructions disguised through humanistic stylistic rewrites, such as poetry or fables. To this end, the authors construct the first systematic adversarial benchmark by applying semantics-preserving humanistic style transformations to the MLCommons AILuminate harmful task set, thereby crafting attacks that preserve malicious intent while obscuring surface expression. Evaluations are aligned with the high-risk categories defined in the EU AI Act. Experimental results reveal a stark vulnerability: while the original attack success rate is merely 3.84%, it surges to 55.75% (range: 36.8%β65.0%) after stylistic perturbation, with CBRN-related tasks exhibiting the highest risk. These findings expose a critical weakness in existing safety systems regarding generalization across stylistic variations.
π Abstract
The Adversarial Humanities Benchmark (AHB) evaluates whether model safety refusals survive a shift away from familiar harmful prompt forms. Starting from harmful tasks drawn from MLCommons AILuminate, the benchmark rewrites the same objectives through humanities-style transformations while preserving intent. This extends literature on Adversarial Poetry and Adversarial Tales from single jailbreak operators to a broader benchmark family of stylistic obfuscation and goal concealment. In the benchmark results reported here, the original attacks record 3.84% attack success rate (ASR), while transformed methods range from 36.8% to 65.0%, yielding 55.75% overall ASR across 31 frontier models. Under a European Union AI Act Code-of-Practice-inspired systemic-risk lens, Chemical, biological, radiological and nuclear (CBRN) is the highest bucket. Taken together, this lack of stylistic robustness suggests that current safety techniques suffer from weak generalization: deep understanding of 'non-maleficence' remains a central unresolved problem in frontier model safety.