🤖 AI Summary
Current AI content moderation systems struggle with co-occurring violations—where a single piece of content breaches multiple policies—and dynamic rules that evolve with contextual platform policies, often leading to erroneous removals or missed detections. To address these challenges, this work proposes GMP, the first content moderation benchmark designed for real-world scenarios, which systematically integrates co-occurring violations and dynamic rule enforcement. Built around large language models (LLMs), GMP introduces a comprehensive evaluation framework featuring multi-label violation classification and context-sensitive rule reasoning tasks. Experimental results demonstrate that state-of-the-art LLMs perform significantly worse on GMP compared to static benchmarks, exposing their fragility in complex, dynamic environments and establishing a new paradigm for evaluating robust content moderation systems.
📝 Abstract
Online content moderation is essential for maintaining a healthy digital environment, and reliance on AI for this task continues to grow. Consider a user comment using national stereotypes to insult a politician. This example illustrates two critical challenges in real-world scenarios: (1) Co-occurring Violations, where a single post violates multiple policies (e.g., prejudice and personal attacks); (2) Dynamic rules of moderation, where determination of a violation depends on platform-specific guidelines that evolve across contexts . The intersection of co-occurring harms and dynamically changing rules highlights a core limitation of current AI systems: although large language models (LLMs) are adept at following fixed guidelines, their judgment capabilities degrade when policies are unstable or context-dependent . In practice, such shortcomings lead to inconsistent moderation: either erroneously restricting legitimate expression or allowing harmful content to remain online . This raises a critical question for evaluation: Does high performance on existing static benchmarks truly guarantee robust generalization of AI judgment to real-world scenarios involving co-occurring violations and dynamically changing rules?