BusterX++: Towards Unified Cross-Modal AI-Generated Content Detection and Explanation with MLLM

📅 2025-07-19

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Existing unimodal detection methods struggle to identify AI-generated content across image–video modalities. To address this, we propose BusterX++, a unified multimodal detection and explanation framework. Methodologically, BusterX++ introduces three key innovations: (1) multi-stage training, (2) a novel “thinking reward” mechanism to guide reasoning, and (3) a hybrid inference strategy—collectively enhancing model stability and cross-modal generalization. Leveraging a multimodal large language model (MLLM), we further construct GenBuster++, a high-quality cross-modal benchmark via reinforcement-learning-based post-training, cold-start mitigation, and a new data filtering pipeline. Experiments demonstrate that BusterX++ achieves consistent and substantial performance gains on GenBuster++, while maintaining strong interpretability and cross-task transferability. Our work establishes a new paradigm for detecting multimodal AI-generated content.

Technology Category

Application Category

📝 Abstract

Recent advances in generative AI have dramatically improved image and video synthesis capabilities, significantly increasing the risk of misinformation through sophisticated fake content. In response, detection methods have evolved from traditional approaches to multimodal large language models (MLLMs), offering enhanced transparency and interpretability in identifying synthetic media. However, current detection systems remain fundamentally limited by their single-modality design. These approaches analyze images or videos separately, making them ineffective against synthetic content that combines multiple media formats. To address these challenges, we introduce extbf{BusterX++}, a novel framework designed specifically for cross-modal detection and explanation of synthetic media. Our approach incorporates an advanced reinforcement learning (RL) post-training strategy that eliminates cold-start. Through Multi-stage Training, Thinking Reward, and Hybrid Reasoning, BusterX++ achieves stable and substantial performance improvements. To enable comprehensive evaluation, we also present extbf{GenBuster++}, a cross-modal benchmark leveraging state-of-the-art image and video generation techniques. This benchmark comprises 4,000 images and video clips, meticulously curated by human experts using a novel filtering methodology to ensure high quality, diversity, and real-world applicability. Extensive experiments demonstrate the effectiveness and generalizability of our approach.

Problem

Research questions and friction points this paper is trying to address.

Detect synthetic media across multiple modalities effectively

Overcome limitations of single-modality detection systems

Provide transparent and interpretable AI-generated content explanations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-modal detection with MLLM

Reinforcement learning post-training strategy

Multi-stage training and hybrid reasoning

🔎 Similar Papers

Detecting Multimedia Generated by Large AI Models: A Survey

2024-01-22arXiv.orgCitations: 53