🤖 AI Summary
This work addresses the limited generalization of existing AI-generated image detection methods to unseen generative models by proposing a novel incremental learning framework that integrates dual-path spectral analysis with retrieval-augmented generation (RAG). The approach employs four-band Fourier decomposition to extract frequency-domain features, combines a partially frozen ViT-L/14 encoder with Kolmogorov–Arnold Network (KAN)-based mixture-of-experts to model band-specific characteristics, and incorporates elastic weight consolidation for continual learning. Notably, it introduces, for the first time, a synergy between spectral consistency priors and RAG, leveraging a Milvus vector database for knowledge retrieval to enhance discriminative robustness. Evaluated on the UniversalFakeDetect benchmark encompassing 19 generative models, the method achieves an average accuracy of 94.6%, substantially outperforming current state-of-the-art techniques.
📝 Abstract
Detecting AI-generated images remains a significant challenge because detectors trained on specific generators often fail to generalize to unseen models; however, while pixel-level artifacts vary across models, frequency-domain signatures exhibit greater consistency, providing a promising foundation for cross-generator detection. To address this, we propose SPARK-IL, a retrieval-augmented framework that combines dual-path spectral analysis with incremental learning by utilizing a partially frozen ViT-L/14 encoder for semantic representations alongside a parallel path for raw RGB pixel embeddings. Both paths undergo multi-band Fourier decomposition into four frequency bands, which are individually processed by Kolmogorov-Arnold Networks (KAN) with mixture-of-experts for band-specific transformations before the resulting spectral embeddings are fused via cross-attention with residual connections. During inference, this fused embedding retrieves the $k$ nearest labeled signatures from a Milvus database using cosine similarity to facilitate predictions via majority voting, while an incremental learning strategy expands the database and employs elastic weight consolidation to preserve previously learned transformations. Evaluated on the UniversalFakeDetect benchmark across 19 generative models -- including GANs, face-swapping, and diffusion methods -- SPARK-IL achieves a 94.6\% mean accuracy, with the code to be publicly released at https://github.com/HessenUPHF/SPARK-IL.