PolyJuice Makes It Real: Black-Box, Universal Red Teaming for Synthetic Image Detectors

๐Ÿ“… 2025-09-18
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing synthetic image detectors (SIDs) rely on white-box access and image-level online optimization for red-teaming, rendering them impractical against black-box, state-of-the-art text-to-image (T2I) models and computationally prohibitive. This paper introduces PolyJuiceโ€”the first black-box, universal red-teaming attack against SIDs. PolyJuice identifies transferable distribution shift directions in the latent space of T2I models via black-box queries, enabling image-agnostic universal adversarial perturbations. It supports low-resolution direction estimation and high-resolution transfer, incorporating direction interpolation and enhanced data fine-tuning. Experiments demonstrate that PolyJuice achieves up to 84% evasion success against mainstream SIDs under black-box settings. Furthermore, fine-tuning SIDs with PolyJuice-generated adversarial data improves their robustness by up to 30%. These results significantly advance the practicality and realism of synthetic image detection research, shifting the adversarial paradigm toward deployable, real-world evaluation.

Technology Category

Application Category

๐Ÿ“ Abstract
Synthetic image detectors (SIDs) are a key defense against the risks posed by the growing realism of images from text-to-image (T2I) models. Red teaming improves SID's effectiveness by identifying and exploiting their failure modes via misclassified synthetic images. However, existing red-teaming solutions (i) require white-box access to SIDs, which is infeasible for proprietary state-of-the-art detectors, and (ii) generate image-specific attacks through expensive online optimization. To address these limitations, we propose PolyJuice, the first black-box, image-agnostic red-teaming method for SIDs, based on an observed distribution shift in the T2I latent space between samples correctly and incorrectly classified by the SID. PolyJuice generates attacks by (i) identifying the direction of this shift through a lightweight offline process that only requires black-box access to the SID, and (ii) exploiting this direction by universally steering all generated images towards the SID's failure modes. PolyJuice-steered T2I models are significantly more effective at deceiving SIDs (up to 84%) compared to their unsteered counterparts. We also show that the steering directions can be estimated efficiently at lower resolutions and transferred to higher resolutions using simple interpolation, reducing computational overhead. Finally, tuning SID models on PolyJuice-augmented datasets notably enhances the performance of the detectors (up to 30%).
Problem

Research questions and friction points this paper is trying to address.

Black-box red teaming for synthetic image detectors
Universal image-agnostic attacks without online optimization
Exploiting distribution shift in T2I latent space
Innovation

Methods, ideas, or system contributions that make the work stand out.

Black-box image-agnostic red-teaming method
Lightweight offline process identifies distribution shift
Universal steering exploits failure modes efficiently
๐Ÿ”Ž Similar Papers