Mapping Toxic Comments Across Demographics: A Dataset from German Public Broadcasting

📅 2025-08-26

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing toxic speech datasets largely lack demographic fine-grained annotations—particularly age—hindering intergenerational analysis of linguistic behavior. To address this, we introduce the first large-scale, age-annotated German toxic comment dataset, drawn from Instagram, TikTok, and YouTube. Our methodology innovatively integrates platform-provided age estimates with human–LLM collaborative annotation (3,000+ expert human labels and 30,000+ LLM-assisted labels), covering categories including insults, misinformation, and critiques related to public broadcasting fees. Analyses reveal significant generational patterns: younger users exhibit more emotionally charged language, whereas middle-aged and older users disproportionately propagate misinformation and derogatory content. With a toxicity annotation rate of 16.7%, this dataset establishes a critical benchmark for modeling linguistic variation, developing fairness-aware content moderation systems, and enabling age-sensitive platform governance.

Technology Category

Application Category

📝 Abstract

A lack of demographic context in existing toxic speech datasets limits our understanding of how different age groups communicate online. In collaboration with funk, a German public service content network, this research introduces the first large-scale German dataset annotated for toxicity and enriched with platform-provided age estimates. The dataset includes 3,024 human-annotated and 30,024 LLM-annotated anonymized comments from Instagram, TikTok, and YouTube. To ensure relevance, comments were consolidated using predefined toxic keywords, resulting in 16.7% labeled as problematic. The annotation pipeline combined human expertise with state-of-the-art language models, identifying key categories such as insults, disinformation, and criticism of broadcasting fees. The dataset reveals age-based differences in toxic speech patterns, with younger users favoring expressive language and older users more often engaging in disinformation and devaluation. This resource provides new opportunities for studying linguistic variation across demographics and supports the development of more equitable and age-aware content moderation systems.

Problem

Research questions and friction points this paper is trying to address.

Lack demographic context toxic speech datasets

Need study age-based differences online communication

Develop equitable age-aware content moderation systems

Innovation

Methods, ideas, or system contributions that make the work stand out.

First large-scale German dataset with toxicity and age annotations

Combined human expertise with state-of-the-art language models

Revealed age-based differences in toxic speech patterns

🔎 Similar Papers

No similar papers found.