Detecting YouTube Scam Videos via Multimodal Signals and Policy Reasoning

📅 2025-09-27

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

YouTube scam video detection suffers from limitations of unimodal approaches—such as susceptibility to evasion and neglect of visual cues. To address this, we propose the first explainable, policy-aware multimodal reasoning framework that jointly leverages titles, descriptions, audio transcripts, and keyframes, while explicitly incorporating platform content policies. Our method employs fine-tuned BERT for textual encoding, LLaVA-Video for visual semantic modeling, and a strategy-guided cross-modal alignment mechanism that embeds policy constraints directly into the detection logic. Evaluated on a real-world YouTube dataset, our approach achieves an F1-score of 80.53%, substantially outperforming unimodal baselines. Furthermore, we publicly release the first large-scale, human-annotated YouTube scam video dataset—comprising 6,374 videos—to advance explainable, compliance-driven multimodal content safety research.

Technology Category

Application Category

📝 Abstract

YouTube has emerged as a dominant platform for both information dissemination and entertainment. However, its vast accessibility has also made it a target for scammers, who frequently upload deceptive or malicious content. Prior research has documented a range of scam types, and detection approaches rely primarily on textual or statistical metadata. Although effective to some extent, these signals are easy to evade and potentially overlook other modalities, such as visual cues. In this study, we present the first systematic investigation of multimodal approaches for YouTube scam detection. Our dataset consolidates established scam categories and augments them with full length video content and policy grounded reasoning annotations. Our experimental evaluation demonstrates that a text-only model using video titles and descriptions (fine-tuned BERT) achieves moderate effectiveness (76.61% F1), with modest improvements when incorporating audio transcripts (77.98% F1). In contrast, visual analysis using a fine-tuned LLaVA-Video model yields stronger results (79.61% F1). Finally, a multimodal framework that integrates titles, descriptions, and video frames achieves the highest performance (80.53% F1). Beyond improving detection accuracy, our multimodal framework produces interpretable reasoning grounded in YouTube content policies, thereby enhancing transparency and supporting potential applications in automated moderation. Moreover, we validate our approach on in-the-wild YouTube data by analyzing 6,374 videos, thereby contributing a valuable resource for future research on scam detection.

Problem

Research questions and friction points this paper is trying to address.

Detecting scam videos on YouTube using multimodal signals

Improving detection accuracy beyond text-only approaches

Providing interpretable reasoning based on content policies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal framework integrates video, text, audio signals

Visual analysis using fine-tuned LLaVA-Video model

Policy-grounded reasoning annotations enhance interpretable detection

🔎 Similar Papers

No similar papers found.