🤖 AI Summary
This work addresses a critical security vulnerability in text-to-image models wherein adversaries exploit the models’ text-rendering capabilities to embed harmful content—such as forged documents—while evading existing defenses that struggle to simultaneously preserve character-level fidelity and effectively block malicious outputs. The authors propose Etch, a black-box attack framework that formalizes, for the first time, the “engraving jailbreak” paradigm. Etch employs a three-stage disentangled strategy—semantic camouflage, visual-spatial anchoring, and font encoding—and leverages zeroth-order optimization guided by vision-language model feedback to iteratively refine prompts without access to internal model parameters. Evaluated across seven state-of-the-art models and two benchmarks, Etch achieves an average attack success rate of 65.57%, peaking at 91.00%, substantially outperforming current baselines and exposing a fundamental blind spot in multimodal safety alignment regarding layout-aware perception.
📝 Abstract
Modern text-to-image (T2I) models can now render legible, paragraph-length text, enabling a fundamentally new class of misuse. We identify and formalize the inscriptive jailbreak, where an adversary coerces a T2I system into generating images containing harmful textual payloads (e.g., fraudulent documents) embedded within visually benign scenes. Unlike traditional depictive jailbreaks that elicit visually objectionable imagery, inscriptive attacks weaponize the text-rendering capability itself. Because existing jailbreak techniques are designed for coarse visual manipulation, they struggle to bypass multi-stage safety filters while maintaining character-level fidelity. To expose this vulnerability, we propose Etch, a black-box attack framework that decomposes the adversarial prompt into three functionally orthogonal layers: semantic camouflage, visual-spatial anchoring, and typographic encoding. This decomposition reduces joint optimization over the full prompt space to tractable sub-problems, which are iteratively refined through a zero-order loop. In this process, a vision-language model critiques each generated image, localizes failures to specific layers, and prescribes targeted revisions. Extensive evaluations across 7 models on the 2 benchmarks demonstrate that Etch achieves an average attack success rate of 65.57% (peaking at 91.00%), significantly outperforming existing baselines. Our results reveal a critical blind spot in current T2I safety alignments and underscore the urgent need for typography-aware defense multimodal mechanisms.