π€ AI Summary
The proliferation of AI-generated videos has triggered a severe trust crisis, yet existing detection methods suffer from small, low-quality datasets and rely on opaque binary classifiers lacking interpretability. To address this, we introduce GenBuster-200Kβthe first large-scale (200K-sample), high-fidelity dataset of real-world AI-generated videos. We further propose BusterX, the first interpretable detection framework integrating multimodal large language models (MLLMs) with reinforcement learning, moving beyond binary classification to support natural-language-based attribution and decision provenance. BusterX jointly models spatiotemporal video features, incorporates an explainable reasoning mechanism, and leverages high-fidelity synthetic data augmentation. Extensive experiments demonstrate that BusterX significantly outperforms state-of-the-art methods across multiple benchmarks, exhibiting strong generalization and robustness under diverse distribution shifts. All code, models, and the GenBuster-200K dataset are publicly released.
π Abstract
Advances in AI generative models facilitate super-realistic video synthesis, amplifying misinformation risks via social media and eroding trust in digital content. Several research works have explored new deepfake detection methods on AI-generated images to alleviate these risks. However, with the fast development of video generation models, such as Sora and WanX, there is currently a lack of large-scale, high-quality AI-generated video datasets for forgery detection. In addition, existing detection approaches predominantly treat the task as binary classification, lacking explainability in model decision-making and failing to provide actionable insights or guidance for the public. To address these challenges, we propose extbf{GenBuster-200K}, a large-scale AI-generated video dataset featuring 200K high-resolution video clips, diverse latest generative techniques, and real-world scenes. We further introduce extbf{BusterX}, a novel AI-generated video detection and explanation framework leveraging multimodal large language model (MLLM) and reinforcement learning for authenticity determination and explainable rationale. To our knowledge, GenBuster-200K is the {it extbf{first}} large-scale, high-quality AI-generated video dataset that incorporates the latest generative techniques for real-world scenarios. BusterX is the {it extbf{first}} framework to integrate MLLM with reinforcement learning for explainable AI-generated video detection. Extensive comparisons with state-of-the-art methods and ablation studies validate the effectiveness and generalizability of BusterX. The code, models, and datasets will be released.