🤖 AI Summary
To address the proliferation of bot accounts on Twitter and their role in disseminating misinformation, this paper proposes BotArtist—the first general-purpose bot detection framework designed for the post-API era. Methodologically, we introduce a Semi-Automatic Machine Learning Pipeline (SAMLP) that integrates nine publicly available datasets and constructs a supervised model grounded in heterogeneous user profile features—including behavioral patterns, metadata, and social graph properties—to ensure cross-dataset generalizability. As a key contribution, we release the first large-scale, real-world bot detection dataset from the 2022–2023 Russia-Ukraine conflict, comprising 10.92 million user feature vectors with BotArtist predictions and 127 million anonymized tweets. Experimental results demonstrate that BotArtist achieves F1-scores of 83.19% on domain-specific detection and 68.5% on general-purpose detection—outperforming 35 state-of-the-art methods by nearly 10% on average.
📝 Abstract
Twitter, as one of the most popular social networks, provides a platform for communication and online discourse. Unfortunately, it has also become a target for bots and fake accounts, resulting in the spread of false information and manipulation. This paper introduces a semi-automatic machine learning pipeline (SAMLP) designed to address the challenges correlated with machine learning model development. Through this pipeline, we develop a comprehensive bot detection model named BotArtist, based on user profile features. SAMLP leverages nine distinct publicly available datasets to train the BotArtist model. To assess BotArtist's performance against current state-of-the-art solutions, we select 35 existing Twitter bot detection methods, each utilizing a diverse range of features. Our comparative evaluation of BotArtist and these existing methods, conducted across nine public datasets under standardized conditions, reveals that the proposed model outperforms existing solutions by almost 10%, in terms of F1-score, achieving an average score of 83.19 and 68.5 over specific and general approaches respectively. As a result of this research, we provide a dataset of the extracted features combined with BotArtist predictions over the 10.929.533 Twitter user profiles, collected via Twitter API during the 2022 Russo-Ukrainian War, over a 16-month period. This dataset was created in collaboration with [Shevtsov et al., 2022a] where the original authors share anonymized tweets on the discussion of the Russo-Ukrainian war with a total amount of 127.275.386 tweets. The combination of the existing text dataset and the provided labeled bot and human profiles will allow for the future development of a more advanced bot detection large language model in the post-Twitter API era.