🤖 AI Summary
This work exposes critical security vulnerabilities in dataset copyright verification (DOV) mechanisms employed during personalized fine-tuning of text-to-image (T2I) diffusion models. To address the susceptibility of existing DOV schemes to evasion, we propose the first copyright evasion attack for T2I models, CEAT2I. CEAT2I identifies watermark-contaminated samples via intermediate feature deviation analysis and pinpoints associated trigger tokens; it then applies a closed-form concept erasure strategy—specifically designed for diffusion models—to nullify watermark-embedded semantics without model retraining or external data. Evaluated on mainstream T2I models including Stable Diffusion, CEAT2I achieves over 95% DOV evasion success rate, with negligible impact on generation quality (FID increase <1.2). This study provides the first systematic failure-mode analysis of DOV in T2I fine-tuning, delivering crucial counterexamples and actionable insights for designing robust copyright protection mechanisms.
📝 Abstract
Text-to-image (T2I) diffusion models have rapidly advanced, enabling high-quality image generation conditioned on textual prompts. However, the growing trend of fine-tuning pre-trained models for personalization raises serious concerns about unauthorized dataset usage. To combat this, dataset ownership verification (DOV) has emerged as a solution, embedding watermarks into the fine-tuning datasets using backdoor techniques. These watermarks remain inactive under benign samples but produce owner-specified outputs when triggered. Despite the promise of DOV for T2I diffusion models, its robustness against copyright evasion attacks (CEA) remains unexplored. In this paper, we explore how attackers can bypass these mechanisms through CEA, allowing models to circumvent watermarks even when trained on watermarked datasets. We propose the first copyright evasion attack (i.e., CEAT2I) specifically designed to undermine DOV in T2I diffusion models. Concretely, our CEAT2I comprises three stages: watermarked sample detection, trigger identification, and efficient watermark mitigation. A key insight driving our approach is that T2I models exhibit faster convergence on watermarked samples during the fine-tuning, evident through intermediate feature deviation. Leveraging this, CEAT2I can reliably detect the watermarked samples. Then, we iteratively ablate tokens from the prompts of detected watermarked samples and monitor shifts in intermediate features to pinpoint the exact trigger tokens. Finally, we adopt a closed-form concept erasure method to remove the injected watermark. Extensive experiments show that our CEAT2I effectively evades DOV mechanisms while preserving model performance.