ModelCitizens:Representing Community Voices in Online Safety

📅 2025-07-07

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Existing toxicity detection models aggregate multi-annotator labels into a single binary label, overlooking community norms, context dependence, and contextual redefinition of terms (e.g., reclaimed language), thus failing to capture the dynamic, socially situated nature of toxicity judgments in social media discourse. Method: We propose a community-aware paradigm: (1) curating MODELCITIZENS—a dataset of 6.8K posts with 40K fine-grained annotations reflecting diverse identity-group perspectives; (2) augmenting inputs with LLM-generated conversational context to explicitly model contextual influence; and (3) fine-tuning LLaMA-8B and Gemma-12B, releasing LLAMACITIZEN-8B and GEMMACITIZEN-12B. Contribution/Results: On in-distribution evaluation, both models outperform GPT-4o-mini by +5.5% F1, significantly improving inclusive content moderation. This work establishes the first unified framework jointly addressing annotation diversity, contextual modeling, and alignment with community norms.

Technology Category

Application Category

📝 Abstract

Automatic toxic language detection is critical for creating safe, inclusive online spaces. However, it is a highly subjective task, with perceptions of toxic language shaped by community norms and lived experience. Existing toxicity detection models are typically trained on annotations that collapse diverse annotator perspectives into a single ground truth, erasing important context-specific notions of toxicity such as reclaimed language. To address this, we introduce MODELCITIZENS, a dataset of 6.8K social media posts and 40K toxicity annotations across diverse identity groups. To capture the role of conversational context on toxicity, typical of social media posts, we augment MODELCITIZENS posts with LLM-generated conversational scenarios. State-of-the-art toxicity detection tools (e.g. OpenAI Moderation API, GPT-o4-mini) underperform on MODELCITIZENS, with further degradation on context-augmented posts. Finally, we release LLAMACITIZEN-8B and GEMMACITIZEN-12B, LLaMA- and Gemma-based models finetuned on MODELCITIZENS, which outperform GPT-o4-mini by 5.5% on in-distribution evaluations. Our findings highlight the importance of community-informed annotation and modeling for inclusive content moderation.

Problem

Research questions and friction points this paper is trying to address.

Addressing subjectivity in toxic language detection models

Incorporating diverse community norms into toxicity annotations

Improving toxicity detection for context-rich social media posts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dataset with diverse identity group annotations

LLM-generated conversational context augmentation

Community-informed LLaMA and Gemma models

🔎 Similar Papers

No similar papers found.