🤖 AI Summary
Content moderators face significant mental health risks—including emotional contagion and internalization of bias—due to prolonged exposure to hate speech. To address this, we propose HateBuffer, the first system that jointly anonymizes targeted entities and gently rewrites offensive text, establishing a reversible psychological safeguard: it dynamically displays de-identified, non-offensive versions at the frontend while enabling on-demand access to original content. Leveraging NLP techniques, HateBuffer integrates hate speech detection, entity de-identification, and controllable text rewriting to balance moderation accuracy with psychological buffering. Empirical evaluation shows that our approach significantly reduces perceived hate intensity (p < 0.01) and improves recall for hateful samples. Although it does not significantly alleviate emotional fatigue, moderators rate HateBuffer as an effective and trustworthy affective buffer. This work provides a deployable, ethically grounded technical framework for platform-level content moderation design.
📝 Abstract
Hate speech remains a persistent and unresolved challenge in online platforms. Content moderators, working on the front lines to review user-generated content and shield viewers from hate speech, often find themselves unprotected from the mental burden as they continuously engage with offensive language. To safeguard moderators' mental well-being, we designed HateBuffer, which anonymizes targets of hate speech, paraphrases offensive expressions into less offensive forms, and shows the original expressions when moderators opt to see them. Our user study with 80 participants consisted of a simulated hate speech moderation task set on a fictional news platform, followed by semi-structured interviews. Although participants rated the hate severity of comments lower while using HateBuffer, contrary to our expectations, they did not experience improved emotion or reduced fatigue compared with the control group. In interviews, however, participants described HateBuffer as an effective buffer against emotional contagion and the normalization of biased opinions in hate speech. Notably, HateBuffer did not compromise moderation accuracy and even contributed to a slight increase in recall. We explore possible explanations for the discrepancy between the perceived benefits of HateBuffer and its measured impact on mental well-being. We also underscore the promise of text-based content modification techniques as tools for a healthier content moderation environment.