Adversarially Robust Detection of Harmful Online Content: A Computational Design Science Approach

📅 2025-12-19

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

Automated detection of harmful social media content—such as hate speech, rumors, and extremist text—is vulnerable to adversarial textual perturbations, leading to increased false negatives and poor generalization. To address this, we propose LLM-SGA-ARHOCD: a framework that first leverages large language models to generate and aggregate diverse adversarial samples (LLM-SGA), thereby enhancing attack coverage; it then introduces an Adaptive Robust Hierarchical Online Content Detector (ARHOCD), integrating multi-base model ensembling, Bayesian dynamic weighting, and domain-knowledge-guided collaborative adversarial training. Evaluated on three real-world datasets, our method achieves significant improvements in adversarial robustness (+12.7% on average) and clean-sample accuracy (+3.4% on average), while demonstrating strong cross-attack generalization and high precision. This work establishes a scalable, robust paradigm for secure online content moderation.

Technology Category

Application Category

📝 Abstract

Social media platforms are plagued by harmful content such as hate speech, misinformation, and extremist rhetoric. Machine learning (ML) models are widely adopted to detect such content; however, they remain highly vulnerable to adversarial attacks, wherein malicious users subtly modify text to evade detection. Enhancing adversarial robustness is therefore essential, requiring detectors that can defend against diverse attacks (generalizability) while maintaining high overall accuracy. However, simultaneously achieving both optimal generalizability and accuracy is challenging. Following the computational design science paradigm, this study takes a sequential approach that first proposes a novel framework (Large Language Model-based Sample Generation and Aggregation, LLM-SGA) by identifying the key invariances of textual adversarial attacks and leveraging them to ensure that a detector instantiated within the framework has strong generalizability. Second, we instantiate our detector (Adversarially Robust Harmful Online Content Detector, ARHOCD) with three novel design components to improve detection accuracy: (1) an ensemble of multiple base detectors that exploits their complementary strengths; (2) a novel weight assignment method that dynamically adjusts weights based on each sample's predictability and each base detector's capability, with weights initialized using domain knowledge and updated via Bayesian inference; and (3) a novel adversarial training strategy that iteratively optimizes both the base detectors and the weight assignor. We addressed several limitations of existing adversarial robustness enhancement research and empirically evaluated ARHOCD across three datasets spanning hate speech, rumor, and extremist content. Results show that ARHOCD offers strong generalizability and improves detection accuracy under adversarial conditions.

Problem

Research questions and friction points this paper is trying to address.

Detect harmful online content robustly against adversarial attacks

Achieve both high generalizability and accuracy in detection

Overcome limitations of existing adversarial robustness enhancement methods

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages LLM-based sample generation for adversarial robustness

Uses ensemble with dynamic Bayesian weight assignment for accuracy

Implements iterative adversarial training optimizing detectors and weights

🔎 Similar Papers

Bridging Social Media and Search Engines: Dredge Words and the Detection of Unreliable Domains