🤖 AI Summary
To address the prevalent clickbait problem on YouTube—where video titles deliberately misrepresent actual content—this paper proposes a robust multimodal deep learning detection framework. The method integrates six heterogeneous feature modalities: title text, user comments, thumbnail images, tags, video statistics, and audio transcriptions. It introduces, for the first time, a weighted ensemble of six modality-specific models, ensuring stable discrimination even under partial modality dropout. By deeply unifying natural language processing, computer vision, and automatic speech recognition techniques, the approach balances high accuracy with strong generalization. Evaluated on a real-world dataset of 1,400 YouTube videos, the system achieves a mean accuracy of 98% with inference latency ≤2 seconds per video—significantly outperforming state-of-the-art unimodal and mainstream multimodal baselines.
📝 Abstract
Following the rising popularity of YouTube, there is an emerging problem on this platform called clickbait, which provokes users to click on videos using attractive titles and thumbnails. As a result, users ended up watching a video that does not have the content as publicized in the title. This issue is addressed in this study by proposing an algorithm called BaitRadar, which uses a deep learning technique where six inference models are jointly consulted to make the final classification decision. These models focus on different attributes of the video, including title, comments, thumbnail, tags, video statistics and audio transcript. The final classification is attained by computing the average of multiple models to provide a robust and accurate output even in situation where there is missing data. The proposed method is tested on 1,400 YouTube videos. On average, a test accuracy of 98% is achieved with an inference time of ≤ 2s.