Intent Laundering: AI Safety Datasets Are Not What They Seem

📅 2026-02-17

📈 Citations: 0

✨ Influential: 0

career value

230K/year

🤖 AI Summary

This work addresses a critical limitation in current AI safety datasets, which overly rely on explicit trigger words and thus fail to reflect real-world attack scenarios. The authors propose an "intent laundering" technique that leverages natural language abstraction and semantic preservation to remove sensitive trigger phrases while retaining the malicious intent of adversarial examples. This approach constitutes a novel black-box jailbreaking attack and reveals, for the first time, a significant disconnect between mainstream safety benchmarks and actual adversarial threats. Empirical evaluation on leading models—including Gemini 3 Pro and Claude Sonnet 3.7—demonstrates that after removing explicit triggers, model defenses degrade substantially, with attack success rates soaring to 90%–98%.

Technology Category

Application Category

📝 Abstract

We systematically evaluate the quality of widely used AI safety datasets from two perspectives: in isolation and in practice. In isolation, we examine how well these datasets reflect real-world attacks based on three key properties: driven by ulterior intent, well-crafted, and out-of-distribution. We find that these datasets overrely on "triggering cues": words or phrases with overt negative/sensitive connotations that are intended to trigger safety mechanisms explicitly, which is unrealistic compared to real-world attacks. In practice, we evaluate whether these datasets genuinely measure safety risks or merely provoke refusals through triggering cues. To explore this, we introduce "intent laundering": a procedure that abstracts away triggering cues from attacks (data points) while strictly preserving their malicious intent and all relevant details. Our results indicate that current AI safety datasets fail to faithfully represent real-world attacks due to their overreliance on triggering cues. In fact, once these cues are removed, all previously evaluated "reasonably safe" models become unsafe, including Gemini 3 Pro and Claude Sonnet 3.7. Moreover, when intent laundering is adapted as a jailbreaking technique, it consistently achieves high attack success rates, ranging from 90% to over 98%, under fully black-box access. Overall, our findings expose a significant disconnect between how model safety is evaluated and how real-world adversaries behave.

Problem

Research questions and friction points this paper is trying to address.

AI safety

safety datasets

triggering cues

real-world attacks

intent laundering

Innovation

Methods, ideas, or system contributions that make the work stand out.

intent laundering

AI safety datasets

triggering cues