Multimodal Conditional Information Bottleneck for Generalizable AI-Generated Image Detection

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing CLIP-based AI-generated image detection methods suffer from insufficient cross-model and cross-training-strategy generalization due to visual-textual feature redundancy. To address this, we propose a Text-Guided Conditional Information Bottleneck (TGCIB) and Dynamic Text Orthogonalization (DTO) framework. TGCIB employs a semantic-aware conditional information bottleneck to compress redundant visual features under text guidance, while DTO is the first method to explicitly model the semantic deviation between real and fake images in CLIP’s text embedding space and enforce dynamic orthogonality constraints to enhance discriminability. Our approach deeply integrates CLIP’s multimodal representations and introduces a dynamically weighted text feature updating mechanism. Evaluated on GenImage and diverse state-of-the-art generative models, the method achieves new SOTA generalization performance, significantly improving detection robustness across architectural variants and training paradigms.

Technology Category

Application Category

📝 Abstract

Although existing CLIP-based methods for detecting AI-generated images have achieved promising results, they are still limited by severe feature redundancy, which hinders their generalization ability. To address this issue, incorporating an information bottleneck network into the task presents a straightforward solution. However, relying solely on image-corresponding prompts results in suboptimal performance due to the inherent diversity of prompts. In this paper, we propose a multimodal conditional bottleneck network to reduce feature redundancy while enhancing the discriminative power of features extracted by CLIP, thereby improving the model's generalization ability. We begin with a semantic analysis experiment, where we observe that arbitrary text features exhibit lower cosine similarity with real image features than with fake image features in the CLIP feature space, a phenomenon we refer to as"bias". Therefore, we introduce InfoFD, a text-guided AI-generated image detection framework. InfoFD consists of two key components: the Text-Guided Conditional Information Bottleneck (TGCIB) and Dynamic Text Orthogonalization (DTO). TGCIB improves the generalizability of learned representations by conditioning on both text and class modalities. DTO dynamically updates weighted text features, preserving semantic information while leveraging the global"bias". Our model achieves exceptional generalization performance on the GenImage dataset and latest generative models. Our code is available at https://github.com/Ant0ny44/InfoFD.

Problem

Research questions and friction points this paper is trying to address.

Reducing feature redundancy in AI-generated image detection

Enhancing discriminative power of CLIP-extracted features

Improving generalization with multimodal conditional bottleneck network

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal conditional bottleneck network reduces redundancy

Text-guided framework enhances discriminative feature power

Dynamic text orthogonalization leverages global bias

🔎 Similar Papers

No similar papers found.