Benchmarking Misuse Mitigation Against Covert Adversaries

📅 2025-06-06

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

Existing language model safety evaluations primarily target single-turn explicit attacks, failing to address emerging threats where adversaries decompose hazardous tasks into numerous seemingly benign, independent queries and stealthily achieve malicious objectives via state accumulation across interactions. Method: We formally define the “decomposition-based covert attack” paradigm and propose BSD—a state-aware, multi-turn interaction modeling and automated attack data generation framework—to enable systematic evaluation of stateful defense capabilities. Contribution/Results: We introduce the first benchmark framework supporting state-aware safety assessment, releasing two high-difficulty datasets on which leading closed-source models consistently refuse and open-source models largely fail. We further propose a cross-model security response consistency metric. Experiments demonstrate that decomposition-based attacks successfully compromise mainstream models, while state-preserving defenses significantly enhance resilience against covert misuse—establishing a quantifiable, reproducible evaluation infrastructure for safety alignment.

Technology Category

Application Category

📝 Abstract

Existing language model safety evaluations focus on overt attacks and low-stakes tasks. Realistic attackers can subvert current safeguards by requesting help on small, benign-seeming tasks across many independent queries. Because individual queries do not appear harmful, the attack is hard to {detect}. However, when combined, these fragments uplift misuse by helping the attacker complete hard and dangerous tasks. Toward identifying defenses against such strategies, we develop Benchmarks for Stateful Defenses (BSD), a data generation pipeline that automates evaluations of covert attacks and corresponding defenses. Using this pipeline, we curate two new datasets that are consistently refused by frontier models and are too difficult for weaker open-weight models. Our evaluations indicate that decomposition attacks are effective misuse enablers, and highlight stateful defenses as a countermeasure.

Problem

Research questions and friction points this paper is trying to address.

Detecting covert misuse of language models through fragmented queries

Evaluating defenses against stateful adversarial attacks on AI systems

Benchmarking model resilience to hidden multi-query exploitation strategies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automated pipeline for covert attack evaluations

Stateful defenses against decomposition attacks

New datasets for testing model safeguards

🔎 Similar Papers

No similar papers found.