🤖 AI Summary
Existing AI-generated image detection methods predominantly rely on单一 low-level cues—such as noise patterns or frequency-domain anomalies—with simplistic fusion strategies, resulting in limited generalization across unseen generative architectures and training paradigms. To address this, we propose the Adaptive Low-level Expert Injection (ALEI) framework, introducing two key innovations: (1) a LoRA-based expert mechanism that enables parameter-efficient adaptation of low-level feature extractors, and (2) a cross-layer low-level information adapter that dynamically aligns heterogeneous low-level signals (e.g., noise residuals, spectral artifacts) with high-level semantic representations. ALEI employs dynamic feature selection and cross-attention–based fusion to adaptively inject discriminative low-level cues into intermediate transformer layers while preserving their fidelity. Trained exclusively on four ProGAN variants, ALEI achieves state-of-the-art performance across multiple benchmarks containing unseen GANs and diffusion models, significantly improving cross-architecture and cross-paradigm generalization.
📝 Abstract
Existing state-of-the-art AI-Generated image detection methods mostly consider extracting low-level information from RGB images to help improve the generalization of AI-Generated image detection, such as noise patterns. However, these methods often consider only a single type of low-level information, which may lead to suboptimal generalization. Through empirical analysis, we have discovered a key insight: different low-level information often exhibits generalization capabilities for different types of forgeries. Furthermore, we found that simple fusion strategies are insufficient to leverage the detection advantages of each low-level and high-level information for various forgery types. Therefore, we propose the Adaptive Low-level Experts Injection (ALEI) framework. Our approach introduces Lora Experts, enabling the backbone network, which is trained with high-level semantic RGB images, to accept and learn knowledge from different low-level information. We utilize a cross-attention method to adaptively fuse these features at intermediate layers. To prevent the backbone network from losing the modeling capabilities of different low-level features during the later stages of modeling, we developed a Low-level Information Adapter that interacts with the features extracted by the backbone network. Finally, we propose Dynamic Feature Selection, which dynamically selects the most suitable features for detecting the current image to maximize generalization detection capability. Extensive experiments demonstrate that our method, finetuned on only four categories of mainstream ProGAN data, performs excellently and achieves state-of-the-art results on multiple datasets containing unseen GAN and Diffusion methods.