Revealing Hidden Mechanisms of Cross-Country Content Moderation with Natural Language Processing

📅 2025-03-07
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the implicit decision-making mechanisms underlying cross-national content moderation. Addressing the limited interpretability of existing NLP moderation models and the unclear origins of cross-national output disparities, we propose the first integrated framework combining inverse modeling and explainability analysis: (1) reverse-inferring country-specific moderation behaviors via multilingual classifiers and Shapley value attribution; and (2) generating structured, human-interpretable explanations using multiple LLMs (LLaMA, GPT, Claude), validated through human-in-the-loop evaluation. Experiments on Twitter streaming data reveal significant heterogeneity and temporal evolution in national moderation patterns. Human evaluation shows 78% agreement on the fidelity of LLM-generated explanations. We publicly release all code and annotated datasets to support methodological advancement and empirical grounding for global platform governance.

Technology Category

Application Category

📝 Abstract
The ability of Natural Language Processing (NLP) methods to categorize text into multiple classes has motivated their use in online content moderation tasks, such as hate speech and fake news detection. However, there is limited understanding of how or why these methods make such decisions, or why certain content is moderated in the first place. To investigate the hidden mechanisms behind content moderation, we explore multiple directions: 1) training classifiers to reverse-engineer content moderation decisions across countries; 2) explaining content moderation decisions by analyzing Shapley values and LLM-guided explanations. Our primary focus is on content moderation decisions made across countries, using pre-existing corpora sampled from the Twitter Stream Grab. Our experiments reveal interesting patterns in censored posts, both across countries and over time. Through human evaluations of LLM-generated explanations across three LLMs, we assess the effectiveness of using LLMs in content moderation. Finally, we discuss potential future directions, as well as the limitations and ethical considerations of this work. Our code and data are available at https://github.com/causalNLP/censorship
Problem

Research questions and friction points this paper is trying to address.

Understanding hidden mechanisms of cross-country content moderation decisions.
Exploring NLP methods to reverse-engineer and explain moderation outcomes.
Assessing LLM effectiveness in generating explanations for content moderation.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reverse-engineer content moderation decisions globally
Analyze Shapley values for moderation decision explanations
Evaluate LLM-generated explanations for content moderation
🔎 Similar Papers
No similar papers found.