🤖 AI Summary
Existing unimodal detection methods struggle to identify AI-generated content across image–video modalities. To address this, we propose BusterX++, a unified multimodal detection and explanation framework. Methodologically, BusterX++ introduces three key innovations: (1) multi-stage training, (2) a novel “thinking reward” mechanism to guide reasoning, and (3) a hybrid inference strategy—collectively enhancing model stability and cross-modal generalization. Leveraging a multimodal large language model (MLLM), we further construct GenBuster++, a high-quality cross-modal benchmark via reinforcement-learning-based post-training, cold-start mitigation, and a new data filtering pipeline. Experiments demonstrate that BusterX++ achieves consistent and substantial performance gains on GenBuster++, while maintaining strong interpretability and cross-task transferability. Our work establishes a new paradigm for detecting multimodal AI-generated content.
📝 Abstract
Recent advances in generative AI have dramatically improved image and video synthesis capabilities, significantly increasing the risk of misinformation through sophisticated fake content. In response, detection methods have evolved from traditional approaches to multimodal large language models (MLLMs), offering enhanced transparency and interpretability in identifying synthetic media. However, current detection systems remain fundamentally limited by their single-modality design. These approaches analyze images or videos separately, making them ineffective against synthetic content that combines multiple media formats. To address these challenges, we introduce extbf{BusterX++}, a novel framework designed specifically for cross-modal detection and explanation of synthetic media. Our approach incorporates an advanced reinforcement learning (RL) post-training strategy that eliminates cold-start. Through Multi-stage Training, Thinking Reward, and Hybrid Reasoning, BusterX++ achieves stable and substantial performance improvements. To enable comprehensive evaluation, we also present extbf{GenBuster++}, a cross-modal benchmark leveraging state-of-the-art image and video generation techniques. This benchmark comprises 4,000 images and video clips, meticulously curated by human experts using a novel filtering methodology to ensure high quality, diversity, and real-world applicability. Extensive experiments demonstrate the effectiveness and generalizability of our approach.