🤖 AI Summary
Existing reference-based approaches for harmful content filtering suffer from poor scalability in large-scale settings and rely on full image generation, making them ill-suited for real-time applications. This work proposes a novel mechanism that embeds efficient reference matching directly within the denoising process. By applying an $x$-pred transformation, intermediate noisy latents are mapped to pseudo-clean latents, enabling early and low-latency content identification and blocking. The method requires no additional training and performs real-time reference comparison in the embedding space, achieving high accuracy and compatibility across diverse generative models. Experimental results demonstrate that the approach reduces processing time by approximately 79% on Z-Image-Turbo and 50% on Qwen-Image while maintaining strong filtering performance.
📝 Abstract
The advent of Text-to-Image generative models poses significant risks of copyright violation and deepfake generation. Since the rapid proliferation of new copyrighted works and private individuals constantly emerges, reference-based training-free content filters are essential for providing up-to-date protection without the constraints of a fixed knowledge cutoff. However, existing reference-based approaches often lack scalability when handling numerous references and require waiting for finishing image generation. To solve these problems, we propose EDGE-Shield, a scalable content filter during the denoising process that maintains practical latency while effectively blocking violative content. We leverage embedding-based matching for efficient reference comparison. Additionally, we introduce an \textit{$x$}-pred transformation that converts the model's noisy intermediate latent into the pseudo-estimated clean latent at the later stage, enhancing classification accuracy of violative content at earlier denoising stages. We conduct experiments of violative content filtering against two generative models including Z-Image-Turbo and Qwen-Image. EDGE-Shield significantly outperforms traditional reference-based methods in terms of latency; it achieves an approximate $79\%$ reduction in processing time for Z-Image-Turbo and approximate $50\%$ reduction for Qwen-Image, maintaining the filtering accuracy across different model architectures.